Towards shutdownable agents via stochastic choice
Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)
GPI Working Paper No. 16-2024
Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that does not happen. A key part of the IPP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DReST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.
Other working papers
Longtermism in an Infinite World – Christian J. Tarsney (Population Wellbeing Initiative, University of Texas at Austin) and Hayden Wilkinson (Global Priorities Institute, University of Oxford)
The case for longtermism depends on the vast potential scale of the future. But that same vastness also threatens to undermine the case for longtermism: If the future contains infinite value, then many theories of value that support longtermism (e.g., risk-neutral total utilitarianism) seem to imply that no available action is better than any other. And some strategies for avoiding this conclusion (e.g., exponential time discounting) yield views that…
Aggregating Small Risks of Serious Harms – Tomi Francis (Global Priorities Institute, University of Oxford)
According to Partial Aggregation, a serious harm can be outweighed by a large number of somewhat less serious harms, but can outweigh any number of trivial harms. In this paper, I address the question of how we should extend Partial Aggregation to cases of risk, and especially to cases involving small risks of serious harms. I argue that, contrary to the most popular versions of the ex ante and ex post views, we should sometimes prevent a small risk that a large number of people will suffer serious harms rather than prevent…
Intergenerational equity under catastrophic climate change – Aurélie Méjean (CNRS, Paris), Antonin Pottier (EHESS, CIRED, Paris), Stéphane Zuber (CNRS, Paris) and Marc Fleurbaey (CNRS, Paris School of Economics)
Climate change raises the issue of intergenerational equity. As climate change threatens irreversible and dangerous impacts, possibly leading to extinction, the most relevant trade-off may not be between present and future consumption, but between present consumption and the mere existence of future generations. To investigate this trade-off, we build an integrated assessment model that explicitly accounts for the risk of extinction of future generations…