Towards shutdownable agents via stochastic choice
Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Imperial College, London), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)
GPI Working Paper No. 16-2024
The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.
Other working papers
Cassandra’s Curse: A second tragedy of the commons – Philippe Colo (ETH Zurich)
This paper studies why scientific forecasts regarding exceptional or rare events generally fail to trigger adequate public response. I consider a game of contribution to a public bad. Prior to the game, I assume contributors receive non-verifiable expert advice regarding uncertain damages. In addition, I assume that the expert cares only about social welfare. Under mild assumptions, I show that no information transmission can happen at equilibrium when the number of contributors…
Consequentialism, Cluelessness, Clumsiness, and Counterfactuals – Alan Hájek (Australian National University)
According to a standard statement of objective consequentialism, a morally right action is one that has the best consequences. More generally, given a choice between two actions, one is morally better than the other just in case the consequences of the former action are better than those of the latter. (These are not just the immediate consequences of the actions, but the long-term consequences, perhaps until the end of history.) This account glides easily off the tongue—so easily that…
Choosing the future: Markets, ethics and rapprochement in social discounting – Antony Millner (University of California, Santa Barbara) and Geoffrey Heal (Columbia University)
This paper provides a critical review of the literature on choosing social discount rates (SDRs) for public cost-benefit analysis. We discuss two dominant approaches, the first based on market prices, and the second based on intertemporal ethics. While both methods have attractive features, neither is immune to criticism. …