Towards shutdownable agents via stochastic choice

Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)

GPI Working Paper No. 16-2024

Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn’t happen. A key part of the IPP is using a novel ‘Discounted REward for Same-Length Trajectories (DREST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.

Other working papers

In defence of fanaticism – Hayden Wilkinson (Australian National University)

Consider a decision between: 1) a certainty of a moderately good outcome, such as one additional life saved; 2) a lottery which probably gives a worse outcome, but has a tiny probability of a far better outcome (perhaps trillions of blissful lives created). Which is morally better? Expected value theory (with a plausible axiology) judges (2) as better, no matter how tiny its probability of success. But this seems fanatical. So we may be tempted to abandon expected value theory…

Intergenerational experimentation and catastrophic risk – Fikri Pitsuwan (Center of Economic Research, ETH Zurich)

I study an intergenerational game in which each generation experiments on a risky technology that provides private benefits, but may also cause a temporary catastrophe. I find a folk-theorem-type result on which there is a continuum of equilibria. Compared to the socially optimal level, some equilibria exhibit too much, while others too little, experimentation. The reason is that the payoff externality causes preemptive experimentation, while the informational externality leads to more caution…

Moral uncertainty and public justification – Jacob Barrett (Global Priorities Institute, University of Oxford) and Andreas T Schmidt (University of Groningen)

Moral uncertainty and disagreement pervade our lives. Yet we still need to make decisions and act, both in individual and political contexts. So, what should we do? The moral uncertainty approach provides a theory of what individuals morally ought to do when they are uncertain about morality…