Towards shutdownable agents via stochastic choice

Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)

GPI Working Paper No. 16-2024

Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn’t happen. A key part of the IPP is using a novel ‘Discounted REward for Same-Length Trajectories (DREST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.

Other working papers

In search of a biological crux for AI consciousness – Bradford Saad (Global Priorities Institute, University of Oxford)

Whether AI systems could be conscious is often thought to turn on whether consciousness is closely linked to biology. The rough thought is that if consciousness is closely linked to biology, then AI consciousness is impossible, and if consciousness is not closely linked to biology, then AI consciousness is possible—or, at any rate, it’s more likely to be possible. A clearer specification of the kind of link between consciousness and biology that is crucial for the possibility of AI consciousness would help organize inquiry into…

Funding public projects: A case for the Nash product rule – Florian Brandl (Stanford University), Felix Brandt (Technische Universität München), Dominik Peters (University of Oxford), Christian Stricker (Technische Universität München) and Warut Suksompong (National University of Singapore)

We study a mechanism design problem where a community of agents wishes to fund public projects via voluntary monetary contributions by the community members. This serves as a model for public expenditure without an exogenously available budget, such as participatory budgeting or voluntary tax programs, as well as donor coordination when interpreting charities as public projects and donations as contributions. Our aim is to identify a mutually beneficial distribution of the individual contributions. …

Tough enough? Robust satisficing as a decision norm for long-term policy analysis – Andreas Mogensen and David Thorstad (Global Priorities Institute, Oxford University)

This paper aims to open a dialogue between philosophers working in decision theory and operations researchers and engineers whose research addresses the topic of decision making under deep uncertainty. Specifically, we assess the recommendation to follow a norm of robust satisficing when making decisions under deep uncertainty in the context of decision analyses that rely on the tools of Robust Decision Making developed by Robert Lempert and colleagues at RAND …