Towards shutdownable agents via stochastic choice

Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University) and Louis Thomson (University of Oxford)

GPI Working Paper No. 16-2024

Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn’t happen. A key part of the IPP is using a novel ‘Discounted REward for Same-Length Trajectories (DREST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.

Other working papers

Time Bias and Altruism – Leora Urim Sung (University College London)

We are typically near-future biased, being more concerned with our near future than our distant future. This near-future bias can be directed at others too, being more concerned with their near future than their distant future. In this paper, I argue that, because we discount the future in this way, beyond a certain point in time, we morally ought to be more concerned with the present well- being of others than with the well-being of our distant future selves. It follows that we morally ought to sacrifice…

Is In-kind Kinder than Cash? The Impact of Money vs Food Aid on Social Emotions and Aid Take-up – Samantha Kassirer, Ata Jami, & Maryam Kouchaki (Northwestern University)

There has been widespread endorsement from the academic and philanthropic communities on the new model of giving cash to those in need. Yet the recipient’s perspective has mostly been ignored. The present research explores how food-insecure individuals feel and respond when offered either monetary or food aid from a charity. Our results reveal that individuals are less likely to accept money than food aid from charity because receiving money feels relatively more shameful and relatively less socially positive. Since many…

Longtermism, aggregation, and catastrophic risk – Emma J. Curran (University of Cambridge)

Advocates of longtermism point out that interventions which focus on improving the prospects of people in the very far future will, in expectation, bring about a significant amount of good. Indeed, in expectation, such long-term interventions bring about far more good than their short-term counterparts. As such, longtermists claim we have compelling moral reason to prefer long-term interventions. …