Towards shutdownable agents via stochastic choice

Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)

GPI Working Paper No. 16-2024

Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn’t happen. A key part of the IPP is using a novel ‘Discounted REward for Same-Length Trajectories (DREST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.

Other working papers

Desire-Fulfilment and Consciousness – Andreas Mogensen (Global Priorities Institute, University of Oxford)

I show that there are good reasons to think that some individuals without any capacity for consciousness should be counted as welfare subjects, assuming that desire-fulfilment is a welfare good and that any individuals who can accrue welfare goods are welfare subjects. While other philosophers have argued for similar conclusions, I show that they have done so by relying on a simplistic understanding of the desire-fulfilment theory. My argument is intended to be sensitive to the complexities and nuances of contemporary…

The cross-sectional implications of the social discount rate – Maya Eden (Brandeis University)

How should policy discount future returns? The standard approach to this normative question is to ask how much society should care about future generations relative to people alive today. This paper establishes an alternative approach, based on the social desirability of redistributing from the current old to the current young. …

It Only Takes One: The Psychology of Unilateral Decisions – Joshua Lewis (New York University) et al.

Sometimes, one decision can guarantee that a risky event will happen. For instance, it only took one team of researchers to synthesize and publish the horsepox genome, thus imposing its publication even though other researchers might have refrained for biosecurity reasons. We examine cases where everybody who can impose a given event has the same goal but different information about whether the event furthers that goal. …