Towards shutdownable agents via stochastic choice
Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Imperial College, London), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)
GPI Working Paper No. 16-2024
The Incomplete Preferences Proposal (IPP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the IPP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.
Other working papers
It Only Takes One: The Psychology of Unilateral Decisions – Joshua Lewis (New York University) et al.
Sometimes, one decision can guarantee that a risky event will happen. For instance, it only took one team of researchers to synthesize and publish the horsepox genome, thus imposing its publication even though other researchers might have refrained for biosecurity reasons. We examine cases where everybody who can impose a given event has the same goal but different information about whether the event furthers that goal. …
Should longtermists recommend hastening extinction rather than delaying it? – Richard Pettigrew (University of Bristol)
Longtermism is the view that the most urgent global priorities, and those to which we should devote the largest portion of our current resources, are those that focus on ensuring a long future for humanity, and perhaps sentient or intelligent life more generally, and improving the quality of those lives in that long future. The central argument for this conclusion is that, given a fixed amount of are source that we are able to devote to global priorities, the longtermist’s favoured interventions have…
When should an effective altruist donate? – William MacAskill (Global Priorities Institute, Oxford University)
Effective altruism is the use of evidence and careful reasoning to work out how to maximize positive impact on others with a given unit of resources, and the taking of action on that basis. It’s a philosophy and a social movement that is gaining considerable steam in the philanthropic world. For example,…