Imperfect Recall and AI Delegation
Eric Olav Chen (Global Priorities Institute, University of Oxford), Alexis Ghersengorin (Global Priorities Institute, University of Oxford) and Sami Petersen (Department of Economics, University of Oxford)
GPI Working Paper No. 30-2024
A principal wants to deploy an artificial intelligence (AI) system to perform some task. But the AI may be misaligned and pursue a conflicting objective. The principal cannot restrict its options or deliver punishments. Instead, the principal can (i) simulate the task in a testing environment and (ii) impose imperfect recall on the AI, obscuring whether the task being performed is real or part of a test. By committing to a testing mechanism, the principal can screen the misaligned AI during testing and discipline its behaviour in deployment. Increasing the number of tests allows the principal to screen or discipline arbitrarily well. The screening effect is preserved even if the principal cannot commit or if the agent observes information partially revealing the nature of the task. Without commitment, imperfect recall is necessary for testing to be helpful.
Other working papers
‘The only ethical argument for positive 𝛿’? – Andreas Mogensen (Global Priorities Institute, Oxford University)
I consider whether a positive rate of pure intergenerational time preference is justifiable in terms of agent-relative moral reasons relating to partiality between generations, an idea I call discounting for kinship. I respond to Parfit’s objections to discounting for kinship, but then highlight a number of apparent limitations of this…
Three mistakes in the moral mathematics of existential risk – David Thorstad (Global Priorities Institute, University of Oxford)
Longtermists have recently argued that it is overwhelmingly important to do what we can to mitigate existential risks to humanity. I consider three mistakes that are often made in calculating the value of existential risk mitigation: focusing on cumulative risk rather than period risk; ignoring background risk; and neglecting population dynamics. I show how correcting these mistakes pushes the value of existential risk mitigation substantially below leading estimates, potentially low enough to…
Estimating long-term treatment effects without long-term outcome data – David Rhys Bernard (Paris School of Economics)
Estimating long-term impacts of actions is important in many areas but the key difficulty is that long-term outcomes are only observed with a long delay. One alternative approach is to measure the effect on an intermediate outcome or a statistical surrogate and then use this to estimate the long-term effect. …