Towards shutdownable agents via stochastic choice

Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)

GPI Working Paper No. 16-2024

Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn’t happen. A key part of the IPP is using a novel ‘Discounted REward for Same-Length Trajectories (DREST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.

Other working papers

Egyptology and Fanaticism – Hayden Wilkinson (Global Priorities Institute, University of Oxford)

Various decision theories share a troubling implication. They imply that, for any finite amount of value, it would be better to wager it all for a vanishingly small probability of some greater value. Counterintuitive as it might be, this fanaticism has seemingly compelling independent arguments in its favour. In this paper, I consider perhaps the most prima facie compelling such argument: an Egyptology argument (an analogue of the Egyptology argument from population ethics). …

Longtermist institutional reform – Tyler M. John (Rutgers University) and William MacAskill (Global Priorities Institute, Oxford University)

There is a vast number of people who will live in the centuries and millennia to come. Even if homo sapiens survives merely as long as a typical species, we have hundreds of thousands of years ahead of us. And our future potential could be much greater than that again: it will be hundreds of millions of years until the Earth is sterilized by the expansion of the Sun, and many trillions of years before the last stars die out. …

It Only Takes One: The Psychology of Unilateral Decisions – Joshua Lewis (New York University) et al.

Sometimes, one decision can guarantee that a risky event will happen. For instance, it only took one team of researchers to synthesize and publish the horsepox genome, thus imposing its publication even though other researchers might have refrained for biosecurity reasons. We examine cases where everybody who can impose a given event has the same goal but different information about whether the event furthers that goal. …