Towards shutdownable agents via stochastic choice

Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Imperial College, London), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)

GPI Working Paper No. 16-2024

The Incomplete Preferences Proposal (IPP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the IPP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

Other working papers

Estimating long-term treatment effects without long-term outcome data – David Rhys Bernard (Rethink Priorities), Jojo Lee and Victor Yaneng Wang (Global Priorities Institute, University of Oxford)

The surrogate index method allows policymakers to estimate long-run treatment effects before long-run outcomes are observable. We meta-analyse this approach over nine long-run RCTs in development economics, comparing surrogate estimates to estimates from actual long-run RCT outcomes. We introduce the M-lasso algorithm for constructing the surrogate approach’s first-stage predictive model and compare its performance with other surrogate estimation methods. …

When should an effective altruist donate? – William MacAskill (Global Priorities Institute, Oxford University)

Effective altruism is the use of evidence and careful reasoning to work out how to maximize positive impact on others with a given unit of resources, and the taking of action on that basis. It’s a philosophy and a social movement that is gaining considerable steam in the philanthropic world. For example,…

Future Suffering and the Non-Identity Problem – Theron Pummer (University of St Andrews)

I present and explore a new version of the Person-Affecting View, according to which reasons to do an act depend wholly on what would be said for or against this act from the points of view of particular individuals. According to my view, (i) there is a morally requiring reason not to bring about lives insofar as they contain suffering (negative welfare), (ii) there is no morally requiring reason to bring about lives insofar as they contain happiness (positive welfare), but (iii) there is a permitting reason to bring about lives insofar as they…