Towards shutdownable agents via stochastic choice
Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)
GPI Working Paper No. 16-2024
Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn’t happen. A key part of the IPP is using a novel ‘Discounted REward for Same-Length Trajectories (DREST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.
Other working papers
The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists – Elliott Thornley (Global Priorities Institute, University of Oxford)
I explain and motivate the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. I prove three theorems that make the difficulty precise. These theorems suggest that agents satisfying some innocuous-seeming conditions will often try to prevent or cause the pressing of the shutdown button, even in cases where it’s costly to do so. I end by noting that…
Intergenerational equity under catastrophic climate change – Aurélie Méjean (CNRS, Paris), Antonin Pottier (EHESS, CIRED, Paris), Stéphane Zuber (CNRS, Paris) and Marc Fleurbaey (CNRS, Paris School of Economics)
Climate change raises the issue of intergenerational equity. As climate change threatens irreversible and dangerous impacts, possibly leading to extinction, the most relevant trade-off may not be between present and future consumption, but between present consumption and the mere existence of future generations. To investigate this trade-off, we build an integrated assessment model that explicitly accounts for the risk of extinction of future generations…
Longtermist political philosophy: An agenda for future research – Jacob Barrett (Global Priorities Institute, University of Oxford) and Andreas T. Schmidt (University of Groningen)
We set out longtermist political philosophy as a research field. First, we argue that the standard case for longtermism is more robust when applied to institutions than to individual action. This motivates “institutional longtermism”: when building or shaping institutions, positively affecting the value of the long-term future is a key moral priority. Second, we briefly distinguish approaches to pursuing longtermist institutional reform along two dimensions: such approaches may be more targeted or more broad, and more urgent or more patient.