Benjamin Tereick | Creating short term proxies for long term forecasts
ROSSA O'KEEFFE-O'DONOVAN: (00:03) So without further ado, I'm going to hand over to Benjamin Tereik. So Benjamin is one of the postdocs at GPI. He's going to be presenting on Creating Short-term Proxies for Long-term Forecasts. And over to you, Benjamin. Oh, and Benjamin will take questions at the end and clarification questions during the talk. Thanks so much.
BENJAMIN TEREICK: (00:18) Very short clarification questions. Thank you, Rossa. All right. So I'm going to talk about creating short-term proxies for long-term forecasts. And what I want to do is, I want to... Sorry, this is not a good angle... explain why short-term proxies are a promising tool for improving our ability to make very long-term forecasts, mostly that is, because short-term proxies can help us evaluate forecasters that can incentivize information acquisition and that can be used for providing feedback for forecasting. And in this talk, I'm going to make an attempt at discussing how proxies should be evaluated, who should create them, and how we should reward the creators in the end. I should say that none of this is like a finished research project but this is sort of a collection of different ideas that feed into different projects, both theoretical and applied, and that I very much would like also other people to take up work on.
(01:17) So to start a bit with the motivation for why we would look at this. Just as a reminder this morning, this is the workshop on Global Priorities research and if you look at the very research agenda of our institute here, we define Global Priorities research as research that responds to the question, "What should we do with a given amount of limited resources if our aim is to do the most good?" And I'm probably not going to make a very wild claim if I say this is a difficult question.
(01:44) So here is a very much not exhaustive list of answers that people have given in reply to the question. So for doing the Global Priorities research, you should maybe try to eradicate preventable diseases. You should prepare for future pandemics. You should mitigate climate change. You should reduce the risks from artificial intelligence and maybe you want to do that by speeding up technological progress or by slowing down technological progress. You might want to work on reducing nuclear war risks. You might want to work on reducing fertility. You might want to work on increasing fertility to avoid an empty planet situation. So there's very, very different ideas and very different cost areas and evaluating these options seems to be really, really hard and depend on very difficult sub-questions.
(02:26) So just to further illustrate, so for instance, if we want to evaluate working on climate change, we might ask, how much will CO2 increase if the world kind of stays on this current consumption path? How would that then exactly affect the climate, say, at the end of the century? How will humans adapt to that changing climate? And maybe how fast will mitigating technologies emerge? We can ask similar questions for AI risk. How long will computing power grow exponentially? Once artificial intelligence gets to a point where we are at human level capacities, how fast are we going to get to superhuman capacities? Very fast, very slow, maybe never. And also how will all this be affected by more economic considerations, so by the competition of tech firms with each other? So this seems like very difficult news and we haven't even touched on any normative questions yet, which I'm also not going to do throughout this talk.
(03:20) However, at least structurally, the situation that we are facing, and we're trying to think of what are the most effective interventions, doesn't seem that much different than the situation faced by some other actors with slightly less ambitious goals. So you could think about governments coming in trying to determine public policy and evaluating various security risks or foreign elections or things like this. So in these situations there seem to be very important sub-questions that seem to be hard to determine empirically. And there seems to be hope. So in recent years, we have made some progress on maybe informing these types of decisions. And even though this is making the study very, very messy, we might at least improve our epistemic baseline. And the tool or a promising tool for this, I believe, is forecasting and an example of that happening were the IARPA tournaments that I think many in the room here have heard about or even participated in. So these are our regular forecasting tournaments. They have been running since 2010 with questions on economics, foreign policy, elections, and so on. And we could see in this tournament that we can actually by using training and clever aggregation of individual judgments, quite improve our epistemic baselines of subject matter experts. And in fact, many of you will know that the winning team for the first two editions of this was the Good Judgment Project led by Philip Tetlock, and... and most members of this team were not subject matter experts, at least for many of the questions that were being asked.
(04:58) So what's the structural similarity to the type of questionsΒ we want to answer when we are looking at the type of interventions I had two slides ago? Well, plausibly kind of the relevant expertise to addressing this question is widely dispersed among humanity. It's typically in the current form, not aggregated yet, at least not before we do so and it also doesn't really exist in any probabilistic format. It's learning from some tacit knowledge out there. So if this is sort of a similar structure, then maybe a Yapa-like forecasting competition could help for Global Priorities research as well.
(05:30) Okay. And here's just me declaring... I'm not claiming I'm the first one to point this out. By far, not many people are optimistic about these possibilities and already leading forecasting organizations are working on this. Here is Metaculus having a question on global warming by the end of the tournament. Here is an online prediction market on China's nominal GDP before 2014. It's really not in the very far future, but still reasonably far. And Metaculus again, on the date of the arrival of Artificial General Intelligence. So this is an idea that is very much out there and I'm not claiming originality in any way.
(06:11) But there is some difference, at least if we kind of buy into weak versions of longtermism, where we at least think among the plausible candidates for the most effective interventions. There are interventions aimed at improving the very long-term. So then we want to make forecasts about the very long-term and there we do face additional structural difficulties compared to the already difficult situation of the government agency trying to determine foreign policy. And that is because it's sort of impossible or at least unattractive to wait until the resolution of the question. And it's useful to have a resolution of forecasting questions because then we can use that to incentivize forecasters, like the better accurate the forecast of the vendors, after the resolution event, I can increase my payment to the forecaster. I can use the resolution to identify experts, so who's the most accurate, and I can also give feedback to the forecasters so that they can learn from previous forecasters. And a possible solution that I want to talk about for the rest of the talk is short-term proxies. So here, the idea is basically to turn long-term forecasting back into short-term forecasting again and get back most of these advantages.
(07:20) So what's a short-term proxy? So suppose we want to do some forecasts of a long-term event πΏ. Short-term proxy π is a question that resolves earlier and is related to πΏ. So as an example, suppose that πΏ is the question, "Will an AI have superhuman abilities on a broad range on cognitive tasks π§π’ years from today?" And short-term proxy could be, "Will an AI win an International Mathematical Olympiad gold medal it π§ years from today?" Now, a perfect short-term proxy, this may be obvious to point out but I think it's useful to remember when you're trying to evaluate proxies, would fully resolve the question. So, for instance, if it was the case that an AI could only win a gold medal by π€π’π€π© if and only if AI would also get to superhuman capabilities by π€π’π©π€, then we would get all the advantages for short-term forecasting by just asking the proxy question. We would fully resolve the uncertainty about the long-term event by the short-term proxy. So that's sort of the ideal scenario.
(08:23) Obviously typically, we won't get there. So typically, short-term proxies will not be perfect. And then we can evaluate the ones that we do get with something akin to existing signal detection theory. So to illustrate this a bit, let's just make all of our events binary for simplicity and let's say πΏ = π£ means the long-term event in question occurs and π, the short-term event that is proposed as a proxy question occurs. Then we can sort of look at these conditional probabilities, the probability of the short-term event resolving positively given that the long-term event will resolve positively, that will be sort of the true positive rate. So people might also know this as the sensitivity of statistical test or they have heard of it as a hit rate. So there's probably some version of this in your statistics courses, and relatedly, we can call the probability that the short-term proxy doesn't occur given that the long-term event doesn't occur as the true negative rate or specificity. And one way to say what a profit proxy would be if both of these probabilities were one. And typically, they won't be both one, so there will be trade-offs between questions that have high sensitivity and high specificity. And typically, you can even make one of them perfect basically, artificially just as an example. So instead of asking the question about the gold medal, I could ask, "Will an AI have superhuman abilities on a broad range of cognitive tasks π§ years from today?" And that will have perfect specificity. So it will never result positively if the π€π’π©π€ event doesn't. But it will probably not be extremely useful.
(10:00) We can also look at this graphically. So this is the positive rate and the negative rate, and again, the perfect proxy question would be just at the top right of this graph. So it would be kind of always say, it's going to occur if it does occur and always going to say, it's not going to occur if it doesn't occur. And sort of at the bottom right and top left, you will have these sort of artificial proxies where you can just make them have perfect specificity or perfect sensitivity, but probably lose too much in the process. And then you can kind of, if you're an economist, you could think about this as a budget set of possible questions that you could ask. And there will be some limits to it if we assume we can't ask perfect proxy questions. And then you can think of this blue line as sort of the efficient frontier. So this is the maximum specificity you can get without trading off any further sensitivity and then you can kind of evaluate on that frontier.
(11:00) But that of course, begs the question of who is going to come up with these questions? And who is going to tell us what the probabilities are? And what I'm proposing here, and again, this has been proposed elsewhere and actually put it into practice, but I'm going to give some new reasons for it, is that it seems advisable to have a division of labor between forecasters and domain experts, where the domain experts come up with the questions. And then once they have come up with a proxy question, then we are going to ask forecasters to provide probabilities both for the probability that the short-term proxy occurs and for these conditional probabilities that the long-term event occurs given that the short-term event occurs. And just to be careful here, these are not the specificity and... Sensitivity is the conditional probability, the other way around. But once we get all three of these, we get the full joint distribution of π and πΏ and then we can calculate the specificity and sensitivity. And then we can do this in various ways, whatever our preferred theory is, to evaluate the usefulness of the short-term proxy. And this usefulness can in turn be used to reward the question creators. Just as a note, the general structure of what I'm saying here is also what Karger, Atanasov and Tetlock argued for in a working paper this year.
(12:19) So why would we have this division of labor? I think there's a nice way to think about domain expertise that doesn't relate necessarily to superior forecasting skill. And to see that, let's think about the set of possible states that could resolve by π€π’π€π© and then we're going to label this Ξ©. And then, of course, there's a vast space of states with a huge amount of details, so if we're thinking about the AI context, this will be like the CEO of a leading AI technology firm who is going to have a shower in the morning and cereal for breakfast and he's going to a press conference later this afternoon where they're going to announce a new developments in technology, but it will also include all kinds of irrelevant things, like what color of socks I am wearing on that day, how many bacteria are there in the room, if bedrooms still exist by that day, what's the body temperature of a capybara in the Andes at that very day and so on. So that sort of the kind of detail that can be omitted and also detail that is essentially unknowable for any forecaster today. So no one has access to the true Ξ©. But it might be helpful to think about rankings of experts in terms of how fine their partition of Ξ© is. So let's say, a domain expert is somebody who, or any expert really, is somebody who has a set representing these states, but these are all lumped into sets of sets and then they form a partition.
(13:51) And domain expertise then just translates to having finer states. So a maximally uninformed expert in terms of domain expertise would just say, state is going to resolve that as possible. So their full partition is just whatever is going to happen will be in the set of possible states. And this is our very uninformed Forecaster #π£. Now we're going to construct a new forecaster, #π€. Let's say O is the set of all states in which an AI wins a gold medal by π€π’π€π©. And we have a Forecaster #π€ that can distinguish between these two possible states. They can say whatever is going to happen in π€π’π€π©, either an AI will win the gold medal or not. So that's sort of the space of possibilities they can think of by π€π’π€π©. There could be a more informed forecaster that also sees, oh there's also the possibility that the United States has established a government department for AI alignment by that time and then they can distinguish between the different realizations of all of these possibilities. And obviously, true humans will have partitions that are much larger than this one, but just to illustrate the ranking you would get here, Forecaster #3 has a finer partition than Forecaster #π€ who has in turn a finer partition then Forecaster #π£. And we would say that Expert π₯ is the most informed.
(15:07) Yes.
HILARY GREAVES: (15:07) This is really a clarification. For an expert to have a given partition is it that they're thinking these are the failing alternatives? Or is it that they know which element to capture...
BENJAMIN TEREICK: (15:16) Yeah. Yeah. That's very important to clarify. So it's the first.
ROSSA O'KEEFFE-O'DONOVAN: (15:19) Will you just repeat the question?
BENJAMIN TEREICK: (15:22) Sure. So the question was, do these participants represent, these are the possibilities that the forecasters are thinking of or is it the one that they can distinguish in terms of which one is true? And it's very much the first. So the second one is a very classic idea to capture the kind of power of information sources. And here I'm kind of turning this around a bit and saying, okay, but we can do the same thing for thinking about the possibilities of states.
(15:54) And then just to point out that the binary question itself is also a partition of the state space. So the question, by π€π’π€π© will I win an IOM Gold Medal, and will the US government have established a department on AI alignment, can be represented by this partition. And then why does domain expertise in the sense that I have defined it matter? Well, the Expert π€ and Expert π£ that I defined earlier, they couldn't even ask this question. And similarly, if somebody comes up with a very technical indicator of an AI development by π€π’π€π©, I probably wouldn't even be able to think of that question. This is how I want to think about this.
(16:31) This can be unrelated to forecasting expertise and maybe the right way to think about forecasting expertise is once we are handed a question or we are handed a state space, now, somebody who is a very good forecaster will have a subjective probability measure on that space that is in some sense just a good one. So one way to think about how could it be a good one is to say there is some true probability measure that's obviously philosophically controversial but if you accept that for a moment, then you could say, if then your personal probability measure on that space is a good/better approximation of the true one, then you're a better forecaster. And what would that mean in practice? So in the IARPA tournaments that I mentioned earlier, you could kind of think of the best performers that were called superforecasters as the ones who had the better approximation of the true probability measure on this intelligence agency relevant events as measured by the Brier Score because that was the reward structure that was used, although probably it would have been true for other scores as well. All right. So this much for the division of labor. Now, if we want to evaluate these short-term proxies in the end, how are we going to get these probabilities?
(17:52) We said, okay, we're going to ask a bunch of forecasters and they might be good in the sense of they have very precise personal probability functions once we hand the state spaced to them. But okay, why would they care? Why would they go to the effort to quantify their subjective probabilities, maybe read up on the subject, maybe they're going to do something even more expensive to acquire information. So probably, we want to reward them. But we are back to this original problem of this difficulty in rewarding forecasts for long-term events because we cannot just wait for the resolution. And that isn't changed by having our conditional forecasts because these are still about the long-term event πΏ. And I'm not going to go too deeply into this because this will be the content of another topic altogether. But there is literature addressing this question, or maybe several literatures. But one literature addressing it is the literature on truth serums started by Drazen Prelec in π€π’π’π¦ on the Bayesian truth serum. And in that literature, for instance, a scheme proposed by Radanovic and Faltings in π€π’π£π¦, that divergence-based Bayesian truth serum can be applied to this context. And just to give you a rough outline of how this would work...
(19:06) So their setting is not with a proxy question or conditional forecast, but it can be adapted to this and then it would work as follows. You ask each forecaster to forecast an unconditioned probability for π β that's a short-term event (that's the one that we can incentivize the classical way) and the probability of πΏ conditional on π. So they're also going to ask you about these conditional probabilities that we're ultimately interested in. Now again, you can reward the forecast for the unconditional short-term events just the normal way. They get an accuracy score. It could be the Brier Score or something else. And now we're going to draw two other forecasters at random. Now we check what did they say? If one of these two forecasters was closer in their forecast to πβs unconditional forecast of π, now we are going to select the one for which this is true. We can measure closeness in various ways, but If you think about these binary events, then the conditional probabilities are just numbers and closer could mean one number is closer to the other in just absolute terms. So we are identifying these, one of the two, and now we are going to look at this person's conditional forecasts. Are they also closer to the original forecaster than the other ones? If the answer is yes, nothing happens and they just all receive their original accuracy score. But if the answer is no, they get a punishment. So this is one way how you can incentivize these unverifiable events here, the conditional forecast of πΏ on π.
(20:37) Do I have... Yeah, I think I do have a couple.
ROSSA O'KEEFFE-O'DONOVAN:Β (20:39) You've got about 20 minutes.
BENJAMIN TEREICK: (20:44) Right. Yeah. I said I can't really summarize the entire truth serum literature here, but just to give you some idea of the required assumptions on the information structure that you need for this scheme that I just told you to incentivize, learning about πΏ and π and also revealing what you learned, you would have something like this.
(21:04) So, a forecaster has some private information. There is the stuff they know about π and πΏ, there's the short-term event π and there's the long-term event πΏ. And the idea would be there are some deep facts about the world underlying everything that affect the short-term event that affect the long-term event and they also affect what you privately know. So that's the π here. For instance, in the AI case, maybe that's something like the physical limits on the possible improvement on computation. You also have a bunch of random things, which are labeled here as π and π. So these are contingent events affecting the outcomes in π€π’π€π© and π€π’π©π€. So this could, for instance, be the office politics at the leading tech firms in the AI sector, both in π€π’π€π© maybe in π€π’π©π€. It's also possible regulation that has happened now to them. But it's not really related to the deep facts, in some sense, about the world. And I hope there's not too many philosophers to press me on this. But there's probably some way for this to be intuitive. If you have a structure like this, this will be incentive compatible. You can also relax this a little bit because plausibly, we can also learn today something a little bit about π and π. So you could draw an arrow from π to private information and from π. But then you get a condition that says something like you get to learn more about π. So π affects your beliefs more than π and π.
(22:40) So we've established it's may be a nice way to translate long-term forecasts to short-term forecasts. We could do this by short-term proxies. This should be generated by domain experts. We can evaluate these questions or we can get relevant inputs for evaluating these questions via forecasters that we can incentivize via truth serum. But then how are we going to evaluate them exactly? So we have this efficient frontier from the signal detection literature. But then what do we do? Like which element on the frontier do we pick? And I think it depends. So I think there are various things you could score for and then use as an incentive structure for question generators. So one idea is just the classical idea from learning structure. So you want to determine short-term events that in some sense are very informative about the long-term event. So by knowing how it results, we reduce the uncertainty for the long-term future. A different motivation might be that we want to identify cruxes. So we want to know which events will cause the disagreement about the long-term events, so you reduce. So maybe for instance, if an AI would win an IMO Gold medal next year, maybe many people would get very, very alarmed and there would be less disagreement about the threat of AI. But maybe if it doesn't have any further improvements on the current state of the art until the year 2027, maybe everybody would agree, okay, maybe we freaked out a bit too much at the beginning of the decade. And this can be useful in its own right because maybe this helps us coordinate on which action to take. We might also be interested in finding questions that tell us very much about who knows something about πΏ or not and we might be able to choose short-term proxies in such a way that we're sort of maximizing the incentives for forecasters to invest into learning about it well. So this is the point where I could have a lot of impressive theoretical results now that I haven't yet. But there are a few thoughts.
(25:00) I think it's quite important to understand the relationship of the two and understand how they relate to question creation. And I think that's relatively easy to do for the first two of this list. So for classical learning, what we probably want to do is we're wanting to aggregate the probabilities that people give us So again, if you think of a binary question, then the probability gives just a number and an aggregation of that could, for instance, just be the arithmetic mean of the two probabilities. It can also be something a lot more fancy but that's one way to think about it. And then from this, we can calculate the value of information once we have all the different forecasts. So one possibility is... a vast space of possible ways to score a question. Let's say we are interested in how much π tells us about accurately predicting πΏ and maybe we want to score this with the logarithmic scoring rule, then a very standard way of thinking what the value of information would be this term. You think about all the possible outcomes that the short-term events can have. In our case, we said it's binary such as π’ and π£, then what's the probability we would predict the long-term events to happen after the resolution of π. And then what will be our expected score, and we sum that over all the possible outcomes of πΏ. Itβs again, only π’ and π£ and then we will get the log of the probability that we predicted the true event to happen. That's what the log scoring rule does. And we can compute all this from the inputs of the forecasters. So the important point here is, once I ask somebody to generate a question and they ask a bunch of people to provide probabilities on the question, I can immediately pay the person for the question generation. I don't even need to wait for the resolution of π.
(26:45) Similarly, with disagreement reduction, I can measure the divergence of forecasters' conditional probabilities. And again, divergence can be a very general construct but if you just think about these probabilities as numbers, divergence can for instance be just the variance of all the probabilities that people get.
[alarm]
(27:11) That's the one we're supposed to ignore. Right? Will there be and other ones or...
ROSSA O'KEEFFE-O'DONOVAN: (27:20) I think thatβs going to be it. I donβt think thereβs going to be more. There's nothing to worry about.
BENJAMIN TEREICK: (27:29) There must be an early alarm low specificity joke here but I'm not quick enough.
(27:38) All right.
[alarm]
(27:44)Β Up to four. All right.
(27:48) So, if you haven't seen this before, you can think about these divergences, for instance, of the variance of the numbers that people give for their conditional probabilities and you compare this to the initial disagreement you had about the long-term event, unconditioning on the short-term event. Now let's go have two things. We can do a similar thingβ¦ as we had for the value of information. So we first aggregate all the probabilities and then we can calculate an expected disagreement deduction. And again, this will just spit out a number once we have all the forecasts, and then we can use that as the payment reward for the person who generated the question. A nice thing maybe about using disagreement reduction, rather than value of information, however, would be maybe we don't need to commit to a specific way to aggregate for payments and we can just wait until π realizes and then pay the actual disagreement reduction in terms of the probability that people pre-committed to. So this you already know. Before anything happens just after you got the question into the forecasts, now I can just wait for which π to pick as my reward. So this is maybe a sort of agnostic approach where you don't say, I'm going to aggregate probabilities in a certain way. You just use the inputs of the forecaster and then see how the disagreement reduces. And that is the value of the question. Disadvantages, I need to wait until π resolves.
(29:17) And just in terms of next steps. I had this conjecture that this will actually be equivalent to this one, at least under some circumstances. It turns out it isn't. I think it would be really valuable to characterize what is in fact, the exact relation. Do they map into each other in a certain sense? Are there specific conditions where they actually are equivalent? And I think another important question is to go to the other motivations. So if we think about questions as providing incentives for investing into knowing stuff about πΏ and π, then is it the case maybe that questions that have a higher value of information, do they also give larger incentives for information seeking? Something like this seems quite intuitive but would be nice to know precisely. So this is pretty much where I am. So just to summarize, I try to justify the use of proxies for long-term questions. I propose to use truth serums for incentivizing the conditional forecasts that we cannot otherwise incentivize and illustrated the required information structure and I guess a story for the different ways of having expertise between domain experts and forecasters and use that as a motivation to do this division of labor between question creation and forecasting.
(30:38) All right. So this is what I have to say and I look forward to your questions. Thanks.Β