Towards shutdownable agents via stochastic choice

Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)

GPI Working Paper No. 16-2024

Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn’t happen. A key part of the IPP is using a novel ‘Discounted REward for Same-Length Trajectories (DREST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.

Other working papers

A paradox for tiny probabilities and enormous values – Nick Beckstead (Open Philanthropy Project) and Teruji Thomas (Global Priorities Institute, Oxford University)

We show that every theory of the value of uncertain prospects must have one of three unpalatable properties. Reckless theories recommend risking arbitrarily great gains at arbitrarily long odds for the sake of enormous potential; timid theories recommend passing up arbitrarily great gains to prevent a tiny increase in risk; nontransitive theories deny the principle that, if A is better than B and B is better than C, then A must be better than C.

Strong longtermism and the challenge from anti-aggregative moral views – Karri Heikkinen (University College London)

Greaves and MacAskill (2019) argue for strong longtermism, according to which, in a wide class of decision situations, the option that is ex ante best, and the one we ex ante ought to choose, is the option that makes the very long-run future go best. One important aspect of their argument is the claim that strong longtermism is compatible with a wide range of ethical assumptions, including plausible non-consequentialist views. In this essay, I challenge this claim…

Ethical Consumerism – Philip Trammell (Global Priorities Institute and Department of Economics, University of Oxford)

I study a static production economy in which consumers have not only preferences over their own consumption but also external, or “ethical”, preferences over the supply of each good. Though existing work on the implications of external preferences assumes price-taking, I show that ethical consumers generically prefer not to act even approximately as price-takers. I therefore introduce a near-Nash equilibrium concept that generalizes the near-Nash equilibria found in literature on strategic foundations of general equilibrium…