Metascience
                    
There is increasing concern about reproducibility in many scientific fields and to what extent statistically significant published research findings are credible. Recent large-scale replication projects in the quantitative social sciences have found that only about 50% of the original studies replicate. In this projet, we continue and expand our work for assessing and improving reproducibility in the social sciences. The project will consist of two studies. In the first study, we will introduce a new tool, decision markets combined with selective replications, for systematically collecting information about reproducibility and steering replication resources to the studies that are most likely to be false. In the second study we will test if so called placebo tests in economics are reported selectively in a form of "reverse p-hacking". To carry out this final sub-project we will collect data for all papers reporting placebo tests in a number of top journals in economics. 
                
                        
                            
                                Final report
                            
                            
                                
                        
            
                                    
Our project application included two projects (“Using decision markets to select which studies to replicate” and “Testing for selective reporting in economics: placebo tests”). Both projects have been carried out as planned without any major deviations from the plan in the grant proposal. 
Project 1: Using decision markets to select which studies to replicate:
For the first project we pre-registered a detailed analysis plan at Open Science Framework prior to starting the data collection. We also pre-registered an analysis plan for each of the 41 potential replications (see below) at Open Science Framework after obtaining feedback from the original authors. We thereafter implemented a decision markets used to select studies to replicate among 41 online social science experiments (162 social scientists participated on the decision markets and prior to participating in the decision markets they also filled out a survey about their replication beliefs for each of the 41 studies). The included 41 papers were all social science experiments published in PNAS between 2015-2018 that fulfilled our inclusion criteria for: (i) the platform on which the experiment was performed (MTurk), (ii) the type of design (between-subjects or within-subject treatment design), (iii) the equipment and materials needed to implement the experiment, and (iv) the results reported in the experiment (that there was at least one statistically significant p<0.05 main or interaction effect). the decision market prices can be interpreted as the estimated probability of replication among the market participants. the 12 studies that had the highest market prices, the 12 studies that have the lowest market prices, and 2 randomly selected studies out of the remaining 17 studies were selected for replication. all replications were high powered with 90% statistical power to detect 2 3 of the effect size reported in the original study at the 5% significance level.>
The average replication sample size of about n=1,018 was about 3.5 times as large as the average original study sample size of n=292. All replications were carried out online at Amazon Mechanical Turk as in the original studies using the same experimental design, materials and analysis as in the original papers. The replication rate, based on the statistical significance indicator, was 83% for the top-12 and 33% for the bottom-12 group and the correlation between the decision market prices and the replication outcomes was 0.505. Overall, 54% of the studies were successfully replicated, with replication effect size estimates averaging 45% of the original effect size estimates. In conclusion, decision markets show potential as a tool for selecting studies for replications, but further work is needed to draw strong conclusions. The observed replication rate of social science experiments based on data collections via MTurk published in PNAS is comparable to previous systematic replication projects of experimental studies in the social sciences, primarily based on lab experiments.
Project 2: Testing for selective reporting in economics: placebo tests:
In observational data studies trying to estimate causal effects in economics, typically using instrumental variables, difference-in-differences, or regression discontinuity methods, it has become standard to carry out so-called placebo tests where the main hypothesis test is carried out on a time period or situation where the estimated effect is expected to be zero (i.e. the null hypothesis is expected to be true). A failure to reject the null hypothesis in the placebo test is interpreted as supporting the validity of the research design to identify causal effects and researchers therefore have an incentive to selectively underreport statistically significant placebo tests (a form of “reverse p-hacking”). We developed an algorithm to search for papers reporting placebo tests. We first applied the algorithm to Economic Journal as a pilot study (not included in any of our hypotheses tests) and based on the pilot study we posted a pre-analysis plan with inclusion/exclusion criteria and our exact tests and hypotheses that was posted at Open Science Framework. After posting the pre-analysis plan the algorithm was applied to 11 other top journals in economics (American Economic Journal: Applied Economics; American Economic Journal: Economic Policy; American Economic Review; Econometrica; Journal of Development economics; Journal of Labor Economics; Journal of Political Economy; Journal of the European Economic Association; Review of Economics and Statistics; Review of Economic Studies; Quarterly Journal of Economics). The algorithm identified 540 papers published between 2009 and 2021 for potential inclusion that were then manually searched for placebo tests and 377 of these papers met our inclusion criteria.
  
If the null hypothesis is true in all placebo tests, 2.5% of them should be statistically significant at the 5% level with an effect in the same direction as the main result of the paper (and 5% in total irrespective of the direction of the effect). The actual fraction of statistically significant placebo tests with an effect in the same direction was 1.29% (95% confidence interval [0.83, 1.63]), which is statistically significantly lower than the 2.5% benchmark (this test was our pre-registered primary hypothesis test as the incentives to underreport statistically significant placebo tests with an effect in the opposite direction of the main findings may be less strong). The overall fraction of statistically significant placebo tests was 3.10% (95% confidence interval [2.2, 4.0]), which is statistically significantly below the 5% benchmark (this was a pre-registered secondary hypothesis test). Our results provide strong evidence of selective underreporting of statistically significant placebo tests in top economics journals. It should be noted that our tests are conservative as the benchmark we test against is that the null hypothesis is true in all placebo tests, which is highly unlikely. The estimated selective underreporting can thus be viewed as a lower bound of the selective underreporting.
The three most important results in the project:
We find suggestive evidence that decision markets can be a useful tool for selecting studies for replication; we find that the replication rate on online social experiments published in PNAS is about 50% and similar to the replication rate for lab experiments found in previous systematic replication projects; and we find evidence of selective reporting of placebo tests in articles published in top economics journals.
Collaborations and dissemination of research results:
The first project involved a large-scale international collaborative project led by us involving researchers from Amsterdam University, CalTech, Harvard University, Massey University in Auckland, National University of Singapore, University of Innsbruck, University of Virginia and Wharton. The results of the two sub-projects have been communicated in two scientific articles published as open access.
                            Project 1: Using decision markets to select which studies to replicate:
For the first project we pre-registered a detailed analysis plan at Open Science Framework prior to starting the data collection. We also pre-registered an analysis plan for each of the 41 potential replications (see below) at Open Science Framework after obtaining feedback from the original authors. We thereafter implemented a decision markets used to select studies to replicate among 41 online social science experiments (162 social scientists participated on the decision markets and prior to participating in the decision markets they also filled out a survey about their replication beliefs for each of the 41 studies). The included 41 papers were all social science experiments published in PNAS between 2015-2018 that fulfilled our inclusion criteria for: (i) the platform on which the experiment was performed (MTurk), (ii) the type of design (between-subjects or within-subject treatment design), (iii) the equipment and materials needed to implement the experiment, and (iv) the results reported in the experiment (that there was at least one statistically significant p<0.05 main or interaction effect). the decision market prices can be interpreted as the estimated probability of replication among the market participants. the 12 studies that had the highest market prices, the 12 studies that have the lowest market prices, and 2 randomly selected studies out of the remaining 17 studies were selected for replication. all replications were high powered with 90% statistical power to detect 2 3 of the effect size reported in the original study at the 5% significance level.>
The average replication sample size of about n=1,018 was about 3.5 times as large as the average original study sample size of n=292. All replications were carried out online at Amazon Mechanical Turk as in the original studies using the same experimental design, materials and analysis as in the original papers. The replication rate, based on the statistical significance indicator, was 83% for the top-12 and 33% for the bottom-12 group and the correlation between the decision market prices and the replication outcomes was 0.505. Overall, 54% of the studies were successfully replicated, with replication effect size estimates averaging 45% of the original effect size estimates. In conclusion, decision markets show potential as a tool for selecting studies for replications, but further work is needed to draw strong conclusions. The observed replication rate of social science experiments based on data collections via MTurk published in PNAS is comparable to previous systematic replication projects of experimental studies in the social sciences, primarily based on lab experiments.
Project 2: Testing for selective reporting in economics: placebo tests:
In observational data studies trying to estimate causal effects in economics, typically using instrumental variables, difference-in-differences, or regression discontinuity methods, it has become standard to carry out so-called placebo tests where the main hypothesis test is carried out on a time period or situation where the estimated effect is expected to be zero (i.e. the null hypothesis is expected to be true). A failure to reject the null hypothesis in the placebo test is interpreted as supporting the validity of the research design to identify causal effects and researchers therefore have an incentive to selectively underreport statistically significant placebo tests (a form of “reverse p-hacking”). We developed an algorithm to search for papers reporting placebo tests. We first applied the algorithm to Economic Journal as a pilot study (not included in any of our hypotheses tests) and based on the pilot study we posted a pre-analysis plan with inclusion/exclusion criteria and our exact tests and hypotheses that was posted at Open Science Framework. After posting the pre-analysis plan the algorithm was applied to 11 other top journals in economics (American Economic Journal: Applied Economics; American Economic Journal: Economic Policy; American Economic Review; Econometrica; Journal of Development economics; Journal of Labor Economics; Journal of Political Economy; Journal of the European Economic Association; Review of Economics and Statistics; Review of Economic Studies; Quarterly Journal of Economics). The algorithm identified 540 papers published between 2009 and 2021 for potential inclusion that were then manually searched for placebo tests and 377 of these papers met our inclusion criteria.
If the null hypothesis is true in all placebo tests, 2.5% of them should be statistically significant at the 5% level with an effect in the same direction as the main result of the paper (and 5% in total irrespective of the direction of the effect). The actual fraction of statistically significant placebo tests with an effect in the same direction was 1.29% (95% confidence interval [0.83, 1.63]), which is statistically significantly lower than the 2.5% benchmark (this test was our pre-registered primary hypothesis test as the incentives to underreport statistically significant placebo tests with an effect in the opposite direction of the main findings may be less strong). The overall fraction of statistically significant placebo tests was 3.10% (95% confidence interval [2.2, 4.0]), which is statistically significantly below the 5% benchmark (this was a pre-registered secondary hypothesis test). Our results provide strong evidence of selective underreporting of statistically significant placebo tests in top economics journals. It should be noted that our tests are conservative as the benchmark we test against is that the null hypothesis is true in all placebo tests, which is highly unlikely. The estimated selective underreporting can thus be viewed as a lower bound of the selective underreporting.
The three most important results in the project:
We find suggestive evidence that decision markets can be a useful tool for selecting studies for replication; we find that the replication rate on online social experiments published in PNAS is about 50% and similar to the replication rate for lab experiments found in previous systematic replication projects; and we find evidence of selective reporting of placebo tests in articles published in top economics journals.
Collaborations and dissemination of research results:
The first project involved a large-scale international collaborative project led by us involving researchers from Amsterdam University, CalTech, Harvard University, Massey University in Auckland, National University of Singapore, University of Innsbruck, University of Virginia and Wharton. The results of the two sub-projects have been communicated in two scientific articles published as open access.