What power-seeking theorems do not show
David Thorstad (Vanderbilt University)
GPI Working Paper No. 27-2024
Recent years have seen increasing concern that artificial intelligence may soon pose an existential risk to humanity. One leading ground for concern is that artificial agents may be power-seeking, aiming to acquire power and in the process disempowering humanity. A range of power-seeking theorems seek to give formal articulation to the idea that artificial agents are likely to be power-seeking. I argue that leading theorems face five challenges, then draw lessons from this result.
Other working papers
Against the singularity hypothesis – David Thorstad (Global Priorities Institute, University of Oxford)
The singularity hypothesis is a radical hypothesis about the future of artificial intelligence on which self-improving artificial agents will quickly become orders of magnitude more intelligent than the average human. Despite the ambitiousness of its claims, the singularity hypothesis has been defended at length by leading philosophers and artificial intelligence researchers. In this paper, I argue that the singularity hypothesis rests on scientifically implausible growth assumptions. …
Towards shutdownable agents via stochastic choice – Elliott Thornley (Global Priorities Institute, University of Oxford), Alexander Roman (New College of Florida), Christos Ziakas (Independent), Leyton Ho (Brown University), and Louis Thomson (University of Oxford)
Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that does not happen. A key part of the IPP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose…
Dispelling the Anthropic Shadow – Teruji Thomas (Global Priorities Institute, University of Oxford)
There are some possible events that we could not possibly discover in our past. We could not discover an omnicidal catastrophe, an event so destructive that it permanently wiped out life on Earth. Had such a catastrophe occurred, we wouldn’t be here to find out. This space of unobservable histories has been called the anthropic shadow. Several authors claim that the anthropic shadow leads to an ‘observation selection bias’, analogous to survivorship bias, when we use the historical record to estimate catastrophic risks. …