Nicholas Otis | Policy Choice and the Wisdom of Crowds
ROSSA O'KEEFFE-O'DONOVAN: (00:03) So next up, we're delighted to have Nick Otis. He's an economics PhD student at UC Berkeley and he's speaking about Policy Choice and the Wisdom of Crowds. And he's happy to take clarification questions throughout and we'll have about five minutes at the end for more substantive questions. Okay. Thanks so much. Over to you Nick.
NICHOLAS OTIS: (00:19) Great. Okay. Thank you all for being here today. I'm Nick Otis. I'm a PhD student at UC Berkeley and today I'm going to be talking about a very general policy problem, which is deciding which intervention to evaluate or scale. So imagine that you're a policymaker and you're trying to select an intervention to increase vaccination. There's a long list of feasible policy interventions that you could select from and it's not obvious ex ante which of these interventions is going to be most effective. In this paper, I test whether crowds of academic experts can predict which policies are going to be more effective. If accurate, we could use these predictions to think about how to prioritize interventions for testing in randomized experiments or as a standalone policy choice mechanism for circumstances where randomized experimentation is impossible. I'm going to test the ability of crowds to rank pairs of policies using data that combines over 9,000 predictions of the results of these randomized controlled trials with gold standard causal evidence. I'll use this data to answer two questions. First, how well can individual experts predict which of two policies will have a larger causal effect? And second, what are the returns from asking additional forecasters, pooling their predictions together, and using this crowd forecast to rank pairs of policies?
(01:58) The talk is going to be broken up into two parts. In the first part, I'll go through an example looking at just two policies in an experiment designed to increase vaccination in Sweden and in the second part of the talk, I'll present a more general set of results which will look at 161 of these policy comparisons over 9,000 forecasts from 863 forecasters spanning seven large randomized experiments.
(02:27) So in this example study, which is now published in “Science”, the authors are testing a bunch of interventions designed to increase COVID-19 vaccination. We're going to focus on just two of these interventions, which are a small financial incentive conditional on receiving the COVID-19 vaccine of around $24 USD and information about the safety and efficacy of the COVID-19 vaccine. So both of these interventions are evaluated in this large experiment. They're both compared to a control condition and here are the results of this trial. On your left, we have the effects of the information intervention, which leads to a 0.7 percentage point increase in vaccination, and this is non-significant. On the right, we have the financial incentives intervention, which leads to a large boost in vaccination, just over 4 percentage points. So results from this large randomized controlled trial indicate that financial incentives are more effective than information at increasing vaccination.
(03:42) Next, we're going to look at whether academics are able to anticipate these results. Can experts predict that the financial incentive intervention will be more effective than the information intervention at increasing vaccination? I'm going to use predictions from 52 experts who provided their responses before the results of this study were made public. These predictions were made independently by experts through a simple email survey, and as with the other six studies, these experts are primarily economics faculty and PhD students. So we're trying to see if these experts forecast that the financial incentive intervention will be more effective than the information intervention.
(04:29) We're going to evaluate the performance of crowd predictions using the following procedure. First, we're going to take the full sample of 52 experts then we're going to bootstrap sample c forecasters from that full pool of experts. So we're sampling forecasters with replacement from that group of 52 experts. Then for each group, for each bootstrapped sample of experts, we're calculating that crowd's average predicted effect for the information intervention and for the financial incentives intervention. And we say that that crowd has made the correct policy choice if their average prediction is larger for the financial incentives intervention than for the information intervention. So these crowds are essentially making a binary choice between these two interventions. We're going to repeat this entire procedure thousands of times for crowds of size 1 to 30 and the outcome is going to be the percent of crowds that correctly predict that the financial incentive intervention will be more effective. So we take a sample of experts, we calculate their average predicted effect for the financial incentive intervention and for the information intervention and then we see whether that crowd prediction correctly ranks those two policies.
(05:47) Here are the results from individual experts, which you can think of as crowds of size 1. These experts here represented by this diamond in the bottom left. They're doing slightly better than chance, predicting that the financial incentive will be... Thank you. The financial incentive will be more effective than the information intervention 55% of the time.
(06:16) Yeah, question.
MALE SPEAKER: (06:16) [inaudible]
NICHOLAS OTIS: (06:21) The experts are predicting the point estimate of a causal effect. So the experts are asked, what is the causal effect of the financial incentive intervention on vaccination compared to control? They're asked the same thing for the information intervention. Then we're taking the average prediction and we're using that average prediction from a group of experts to create a binary ranking between those two interventions.
(06:44) Yeah.
MALE SPEAKER: (06:44) Sorry. Would you just be able to explain a little bit more about the rationale of bootstrapping? It seems that would just artificially sort of minimize the variance that... You're taking an average of averages essentially.
ROSSA O'KEEFFE-O'DONOVAN: (06:56) Nick, would you repeat the question.
NICHOLAS OTIS: (06:57) Yes. Yes. Absolutely. So the question here is basically I think, is this kind of a mechanical result from taking bootstrapped samples of experts? Is that right? Okay. Great.
(07:09) Yeah. So the performance of crowds of experts are evaluated at the crowd level. So you can think about the performance of... Let's say you have a sample of 10 experts. You evaluate whether their predictions rank information as being less effective than financial incentives. As the crowd size increases, what you're going to observe is convergence towards the prediction from the full sample of experts, the full set of 52 experts. Now if those experts, on average, predict that financial incentives will be less effective, then you would see as the crowd size increases, you would see this converge toward 0. So basically, the convergence either towards 100% or toward 0 is a function of the empirical distribution of expert beliefs and not a mechanical result from aggregation.
(08:10) Okay. So here's the performance now going from crowds of size 1 to crowds of size 30. Individual experts are doing slightly better than chance, but when we take the predictions of experts and we pool them together, we see substantial improvements in the ability of these crowds of experts to rank pairs of policies. Crowds of size 10 are ranking pairs of policies correctly 95% of the time and crowds of size 30 are identifying financial incentives as the more effective intervention nearly 100% of the time. So that's two policies from one randomized experiment.
(08:49) Now I'm going to present the full set of results, which will cover 161 policy comparisons from seven experiments. So this is a much more general set of results. I'm not going to be able to talk through the details of all seven experiments but I’m doing a highlight of a few features of these studies.
(09:07) First, the settings and interventions are very diverse. They range from interventions in Kenya looking at the effects of cash transfers benchmarked against various psychological interventions to a study in Jordan looking at the effects of soft skills training and wage subsidies on female labor force participation to the effects of different interventions to increase vaccination in Sweden, we already discussed two of these interventions, to a variety of studies that are looking at the effects of short nudges. So this is a very diverse set of studies and interventions.
(09:45) Second, each of these studies has a precisely estimated causal effect. So we have this gold standard causal evidence from a randomized controlled trial that we're using as a benchmark to evaluate the performance of crowds of experts.
(10:00) Third, each of these studies has a separate group of experts. So we have seven independent samples of experts. The results are not going to be limited to a particular group of experts that have very accurate beliefs.
(10:15) Now we'll move on to the full set of results. It's pooling the performance of crowds of experts across all seven studies, giving each of these seven studies equal weight.
(10:28) First, we're going to look at, again, the performance of individual experts. How well can individual experts predict which of two policies will have a larger causal effect? We can see that these individual experts are once again, they're doing slightly better than chance. 65% of individual experts across the seven experiments are able to predict which of two policies will have a larger effect, which policy will be more impactful. But when we take the predictions of individual experts and we aggregate them together, we see large improvements in the ability of these crowds to rank pairs of policies. Crowds of size 10 are 17 percentage points more likely to identify the better performing policy. Crowds of size 30 are identifying the better performing policy 86% of the time.
(11:17) Eva.
EVA: (11:18) How are you accounting for… So if I'm understanding correctly, you've got multiple policies and then you're taking pairs of them every time.
NICHOLAS OTIS: (11:27) Yeah.
EVA: (11:28) How are you accounting for the sort of shared overlap... interventions 1, 2 and 3, comparing them to… There are some correlation there between...
NICHOLAS OTIS: (11:40) Yeah. So in these results, I'm basically ignoring that correlation. Yeah, sorry. So this is... So this is... How am I accounting for the fact that when we're looking at studies with several interventions, there's going to be some correlation if intervention 1 is better than intervention 2 and intervention 1 is better than intervention 3, then there's correlation between the ranking of studies. And basically, I'm just looking at... I'm thinking of this as just discrete pairs of policies, so I'm not doing anything to account for that correlation.
(12:10) Okay. So crowds of experts are performing quite well in absolute terms and they're also doing much better than individual experts at ranking pairs of policies.
(12:19) Yeah. Angelo.
ANGELO: (12:19) What are the confidence intervals?
NICHOLAS OTIS: (12:22) The confidence intervals are bootstrapped confidence intervals that I've... Yeah, standard bootstrapped confidence intervals that I've simulated. Sorry. The question is, what are the confidence intervals? And these are 95% and 99%.
(12:39) Okay. So far, I've been saying that policy A is more effective than policy B, if policy A has a larger treatment effect than policy B. But in some cases, policy A will be trivially more effective than policy B. We might be especially interested in comparisons, either where the magnitude of the effects between policies is large or where the results are less noisy. So in the next set of analyses, I'm going to focus on pairs of policies that are significantly different at the 0.1 level. So these are pairs of policies where we can be somewhat more confident that there's a meaningful difference in the effects of the interventions. And when we look at these pairs of interventions that were somewhat more confident in the differences in effects, we see even larger wisdom of the crowds effects. Now, crowds of size 10 are identifying a better performing policy 88% of the time, crowds of size 30 are identifying the better performing policy 92% of the time.
(13:40) Yeah.
MALE SPEAKER: (13:42) So what... For the t-statistic, are the experts predicting whether it would be statistically significant?
NICHOLAS OTIS: (13:55) No. Yeah.
MALE SPEAKER: (13:59) Are you adjusting the t-statistic for, you know, one study has seven pairwise comparisons to...
NICHOLAS OTIS: (14:08) It's... Yeah. The question is, when I'm thinking there are tons of comparisons here, am I adjusting the t-statistic to account for the fact that there's a lot of inference going on here. I'm not adjusting it. And in some sense, that means you can think of these results as a lower bound. There's going to be a lot of sampling variation across these many studies. Some of the rankings are still going to be due to chance alone. So the fact that experts are doing well here should be reassuring.
(14:35) Finally, we're going to split the results up across these seven studies. So what I presented before was basically a convex combination of these seven figures giving each study equal weight. There are two things to note about this figure. First, there's a lot of variation in the initial disagreement among experts. In panel G, which looks at nudges to increase mask wearing in the US, initial disagreement is close to 50%. Experts are getting... Individual experts are predicting which policy will be more effective only about half the time, whereas in panel B, looking at the effects of psychotherapy versus cash transfers in rural Kenya, individual experts are doing quite well. But importantly, across all seven studies, there are meaningful improvements in policy choice from aggregating predictions and using these crowd forecasts to rank pairs of policies.
(15:33) The purpose of this paper is to test how well experts can predict which policies will be more effective. I use data from seven large randomized experiments and predictions from 863 experts of the causal effects of these interventions. I showed that crowds of experts do quite well at identifying which policies will have larger causal effects. Crowds of size 30 are identifying the better performing policy 86% of the time compared to just 65% of the time for individual experts. And these results are even stronger if we restrict ourselves to pairs of policies where we're somewhat more confident in the difference in effects. So in sum, I think these results suggest that predictions from crowds of experts are a valuable tool for selecting policies.
(16:25) Thank you for your attention.