The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

Elliott Thornley (Global Priorities Institute, University of Oxford)

GPI Working Paper No. 10-2024, forthcoming in Philosophical Studies

I explain and motivate the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. I prove three theorems that make the difficulty precise. These theorems suggest that agents satisfying some innocuous-seeming conditions will often try to prevent or cause the pressing of the shutdown button, even in cases where it’s costly to do so. I end by noting that these theorems can guide our search for solutions to the problem.

Other working papers

Longtermism, aggregation, and catastrophic risk – Emma J. Curran (University of Cambridge)

Advocates of longtermism point out that interventions which focus on improving the prospects of people in the very far future will, in expectation, bring about a significant amount of good. Indeed, in expectation, such long-term interventions bring about far more good than their short-term counterparts. As such, longtermists claim we have compelling moral reason to prefer long-term interventions. …

A non-identity dilemma for person-affecting views – Elliott Thornley (Global Priorities Institute, University of Oxford)

Person-affecting views in population ethics state that (in cases where all else is equal) we’re permitted but not required to create people who would enjoy good lives. In this paper, I present an argument against every possible variety of person- affecting view. The argument takes the form of a dilemma. Narrow person-affecting views must embrace at least one of three implausible verdicts in a case that I call ‘Expanded Non- Identity.’ Wide person-affecting views run into trouble in a case that I call ‘Two-Shot Non-Identity.’ …

Dynamic public good provision under time preference heterogeneity – Philip Trammell (Global Priorities Institute and Department of Economics, University of Oxford)

I explore the implications of time preference heterogeneity for the private funding of public goods. The assumption that players use a common discount rate is knife-edge: relaxing it yields substantially different equilibria, for two reasons. First, time preference heterogeneity motivates intertemporal polarization, analogous to the polarization seen in a static public good game. In the simplest settings, more patient players spend nothing early in time and less patient players spending nothing later. Second…