Towards shutdownable agents via stochastic choice

Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Imperial College, London), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)

GPI Working Paper No. 16-2024

The Incomplete Preferences Proposal (IPP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the IPP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

Other working papers

Desire-Fulfilment and Consciousness – Andreas Mogensen (Global Priorities Institute, University of Oxford)

I show that there are good reasons to think that some individuals without any capacity for consciousness should be counted as welfare subjects, assuming that desire-fulfilment is a welfare good and that any individuals who can accrue welfare goods are welfare subjects. While other philosophers have argued for similar conclusions, I show that they have done so by relying on a simplistic understanding of the desire-fulfilment theory. My argument is intended to be sensitive to the complexities and nuances of contemporary…

A non-identity dilemma for person-affecting views – Elliott Thornley (Global Priorities Institute, University of Oxford)

Person-affecting views in population ethics state that (in cases where all else is equal) we’re permitted but not required to create people who would enjoy good lives. In this paper, I present an argument against every possible variety of person- affecting view. The argument takes the form of a dilemma. Narrow person-affecting views must embrace at least one of three implausible verdicts in a case that I call ‘Expanded Non- Identity.’ Wide person-affecting views run into trouble in a case that I call ‘Two-Shot Non-Identity.’ …

Should longtermists recommend hastening extinction rather than delaying it? – Richard Pettigrew (University of Bristol)

Longtermism is the view that the most urgent global priorities, and those to which we should devote the largest portion of our current resources, are those that focus on ensuring a long future for humanity, and perhaps sentient or intelligent life more generally, and improving the quality of those lives in that long future. The central argument for this conclusion is that, given a fixed amount of are source that we are able to devote to global priorities, the longtermist’s favoured interventions have…