Towards shutdownable agents via stochastic choice

Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Imperial College, London), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)

GPI Working Paper No. 16-2024

The Incomplete Preferences Proposal (IPP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the IPP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

Other working papers

‘The only ethical argument for positive 𝛿’? – Andreas Mogensen (Global Priorities Institute, Oxford University)

I consider whether a positive rate of pure intergenerational time preference is justifiable in terms of agent-relative moral reasons relating to partiality between generations, an idea I call ​discounting for kinship​. I respond to Parfit’s objections to discounting for kinship, but then highlight a number of apparent limitations of this…

Concepts of existential catastrophe – Hilary Greaves (University of Oxford)

The notion of existential catastrophe is increasingly appealed to in discussion of risk management around emerging technologies, but it is not completely clear what this notion amounts to. Here, I provide an opinionated survey of the space of plausibly useful definitions of existential catastrophe. Inter alia, I discuss: whether to define existential catastrophe in ex post or ex ante terms, whether an ex ante definition should be in terms of loss of expected value or loss of potential…

Once More, Without Feeling – Andreas Mogensen (Global Priorities Institute, University of Oxford)

I argue for a pluralist theory of moral standing, on which both welfare subjectivity and autonomy can confer moral status. I argue that autonomy doesn’t entail welfare subjectivity, but can ground moral standing in its absence. Although I highlight the existence of plausible views on which autonomy entails phenomenal consciousness, I primarily emphasize the need for philosophical debates about the relationship between phenomenal consciousness and moral standing to engage with neglected questions about the nature…