Towards shutdownable agents via stochastic choice
Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)
GPI Working Paper No. 16-2024
Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that does not happen. A key part of the IPP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DReST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.
Other working papers
Future Suffering and the Non-Identity Problem – Theron Pummer (University of St Andrews)
I present and explore a new version of the Person-Affecting View, according to which reasons to do an act depend wholly on what would be said for or against this act from the points of view of particular individuals. According to my view, (i) there is a morally requiring reason not to bring about lives insofar as they contain suffering (negative welfare), (ii) there is no morally requiring reason to bring about lives insofar as they contain happiness (positive welfare), but (iii) there is a permitting reason to bring about lives insofar as they…
Longtermist institutional reform – Tyler M. John (Rutgers University) and William MacAskill (Global Priorities Institute, Oxford University)
There is a vast number of people who will live in the centuries and millennia to come. Even if homo sapiens survives merely as long as a typical species, we have hundreds of thousands of years ahead of us. And our future potential could be much greater than that again: it will be hundreds of millions of years until the Earth is sterilized by the expansion of the Sun, and many trillions of years before the last stars die out. …
Numbers Tell, Words Sell – Michael Thaler (University College London), Mattie Toma (University of Warwick) and Victor Yaneng Wang (Massachusetts Institute of Technology)
When communicating numeric estimates with policymakers, journalists, or the general public, experts must choose between using numbers or natural language. We run two experiments to study whether experts strategically use language to communicate numeric estimates in order to persuade receivers. In Study 1, senders communicate probabilities of abstract events to receivers on Prolific, and in Study 2 academic researchers communicate the effect sizes in research papers to government policymakers. When…