Towards shutdownable agents via stochastic choice

Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Imperial College, London), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)

GPI Working Paper No. 16-2024

The Incomplete Preferences Proposal (IPP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the IPP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

Other working papers

AI alignment vs AI ethical treatment: Ten challenges – Adam Bradley (Lingnan University) and Bradford Saad (Global Priorities Institute, University of Oxford)

A morally acceptable course of AI development should avoid two dangers: creating unaligned AI systems that pose a threat to humanity and mistreating AI systems that merit moral consideration in their own right. This paper argues these two dangers interact and that if we create AI systems that merit moral consideration, simultaneously avoiding both of these dangers would be extremely challenging. While our argument is straightforward and supported by a wide range of pretheoretical moral judgments, it has far-reaching…

The end of economic growth? Unintended consequences of a declining population – Charles I. Jones (Stanford University)

In many models, economic growth is driven by people discovering new ideas. These models typically assume either a constant or growing population. However, in high income countries today, fertility is already below its replacement rate: women are having fewer than two children on average. It is a distinct possibility — highlighted in the recent book, Empty Planet — that global population will decline rather than stabilize in the long run. …

Non-additive axiologies in large worlds – Christian Tarsney and Teruji Thomas (Global Priorities Institute, Oxford University)

Is the overall value of a world just the sum of values contributed by each value-bearing entity in that world? Additively separable axiologies (like total utilitarianism, prioritarianism, and critical level views) say ‘yes’, but non-additive axiologies (like average utilitarianism, rank-discounted utilitarianism, and variable value views) say ‘no’…