Towards shutdownable agents via stochastic choice
Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)
GPI Working Paper No. 16-2024
Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that does not happen. A key part of the IPP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DReST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.
Other working papers
Dispelling the Anthropic Shadow – Teruji Thomas (Global Priorities Institute, University of Oxford)
There are some possible events that we could not possibly discover in our past. We could not discover an omnicidal catastrophe, an event so destructive that it permanently wiped out life on Earth. Had such a catastrophe occurred, we wouldn’t be here to find out. This space of unobservable histories has been called the anthropic shadow. Several authors claim that the anthropic shadow leads to an ‘observation selection bias’, analogous to survivorship bias, when we use the historical record to estimate catastrophic risks. …
Doomsday and objective chance – Teruji Thomas (Global Priorities Institute, Oxford University)
Lewis’s Principal Principle says that one should usually align one’s credences with the known chances. In this paper I develop a version of the Principal Principle that deals well with some exceptional cases related to the distinction between metaphysical and epistemic modality. I explain how this principle gives a unified account of the Sleeping Beauty problem and chance-based principles of anthropic reasoning…
Intergenerational equity under catastrophic climate change – Aurélie Méjean (CNRS, Paris), Antonin Pottier (EHESS, CIRED, Paris), Stéphane Zuber (CNRS, Paris) and Marc Fleurbaey (CNRS, Paris School of Economics)
Climate change raises the issue of intergenerational equity. As climate change threatens irreversible and dangerous impacts, possibly leading to extinction, the most relevant trade-off may not be between present and future consumption, but between present consumption and the mere existence of future generations. To investigate this trade-off, we build an integrated assessment model that explicitly accounts for the risk of extinction of future generations…