Towards shutdownable agents via stochastic choice

Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)

GPI Working Paper No. 16-2024

Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that does not happen. A key part of the IPP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DReST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.

Other working papers

Estimating long-term treatment effects without long-term outcome data – David Rhys Bernard (Paris School of Economics)

Estimating long-term impacts of actions is important in many areas but the key difficulty is that long-term outcomes are only observed with a long delay. One alternative approach is to measure the effect on an intermediate outcome or a statistical surrogate and then use this to estimate the long-term effect. …

Misjudgment Exacerbates Collective Action Problems – Joshua Lewis (New York University) et al.

In collective action problems, suboptimal collective outcomes arise from each individual optimizing their own wellbeing. Past work assumes individuals do this because they care more about themselves than others. Yet, other factors could also contribute. We examine the role of empirical beliefs. Our results suggest people underestimate individual impact on collective problems. When collective action seems worthwhile, individual action often does not, even if the expected ratio of costs to benefits is the same. …

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists – Elliott Thornley (Global Priorities Institute, University of Oxford)

I explain and motivate the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. I prove three theorems that make the difficulty precise. These theorems suggest that agents satisfying some innocuous-seeming conditions will often try to prevent or cause the pressing of the shutdown button, even in cases where it’s costly to do so. I end by noting that…