Summary: Will AI avoid exploitation?

This is a summary of the GPI working paper “Will AI avoid exploitation?”  by Adam Bales, forthcoming in Philosophical Studies. The summary was written by Riley Harris.

We might hope that there is a straightforward way of predicting the behaviour of future artificial intelligence (AI) systems. Some have suggested that AI will maximise expected utility, because anything else would allow them to accept a series of trades that result in a guaranteed loss of something valuable (Omohundro, 2008). Indeed, we would be able to predict AI behaviour if the following claims were true:

  1. AI will avoid exploitation
  2. Avoiding exploitation means maximising expected utility
  3. We are able to predict the behaviour of agents that maximise expected utility

Adam Bales argues that these claims are all false in his paper Will AI avoid exploitation?

AI won't avoid exploitation

Here, “exploitation” is meant in a technical sense. An agent is exploitable if you can offer them a series of choices that lead to a guaranteed loss. For instance, if an agent is willing to pay $1 to swap an apple for an orange, but also willing to pay $1 to swap back to the orange, that agent is exploitable. After two trades they would be back with the apple they started with, minus $2. A natural assumption is that AI systems will be deployed in a competitive environment that will force them to avoid exploitable preferences given sufficient training data and computational resources to do so.

But there are reasons to think that AI will be exploitable in at least some scenarios. We might suspect as much when we notice that companies do not avoid exploitation, despite their competitive environment.1 Additionally, animal behaviour tends to be more consistent than human behaviour, despite our cognitive sophistication (Stanovich, 2013). In general, there might be a trade-off between understanding and processing complex information, and choosing in ways that avoid exploitation (Thorstad, forthcoming).

It also seems that the benefits of avoiding exploitation might be lower than they appear. First, there is no benefit to avoiding hypothetical exploitation. AI systems that always avoid exploitation will spend a lot of resources on avoiding hypothetical situations that are unlikely to arise in the real world. Second, even agents who are immune to exploitation in the technical sense may be vulnerable to other kinds of, well, exploitation. To see this, consider how a human might be exploited by someone with better knowledge of the stock market. This kind of exploitation would not rely on patterns of preferences but on a lack of knowledge and ability in a particular domain. So avoiding exploitation (in the technical sense) is less important than it initially sounds. The costs are also higher than they first appear: for instance, avoiding exploitation in a sufficiently general way is computationally intractable (van Rooij et al., 2018).

Even if AI avoids exploitation, it may not maximise expected utility

An agent maximises expected utility if it makes choices that are consistent with the maximisation of the expectation of some utility function. This does not mean an AI system needs to have explicit utility and probability functions, this is just a way for us to understand and predict which decisions it will make.

We can break down the concept of maximising expected utility into more simple behavioural patterns. If an agent satisfies all of the patterns, they maximise expected utility. But if they violate one of these patterns, then they do not. We can use this breakdown to see whether agents that avoid exploitation (in the technical sense) will really maximise expected utility.

In particular, one of the behavioural patterns we expect from an agent maximising expected utility would be "continuity" (Fishburn, 1970). Suppose you prefer outcomes A to B to C but there is also a lottery that gives you A with a probability p and C with probability 1-p. Continuity implies that there is some (high enough)  p  that you would choose this lottery over getting B for sure, and for some (small enough) p’ for which you will choose B.

An agent who violates continuity will not maximise expected utility. However, if an agent avoids exploitation, this doesn’t tell us that they will satisfy continuity. This is because even agents that do not satisfy continuity may still avoid any guaranteed loss (and therefore avoid exploitability). This means that if we know an agent avoids exploitation, we would not know that they maximise expected utility.2 

Interestingly, failing to satisfy continuity could result in some strange behaviour. Bales finds that an agent that does not satisfy continuity would pay a cost for a very small chance of getting something better, no matter how small that chance is. Bales calls this quasi-exploitability, but notices that we do not have particularly strong reasons to believe AI will avoid quasi-exploitability, which may even be seen as appropriate under conditions of very high potential payoffs. Ultimately, this line of reasoning fails to show that AI will maximise expected utility.

Even if AI will maximise expected utility, knowing this will not help us predict its behaviour

Even if AI systems did act as if they maximise expected utility, this would not allow us to predict their behaviour. This is because we would only know what they would do relative to some probability function and some utility function. Consider an AI whose utility function assigns a value of 0 to all of the outcomes other than the one it expects to receive by acting the way it does. This agent could be seen as maximising expected utility, but, we will not know which outcomes the agent will choose before it acts.

We might be able to partially get around this by using our knowledge of how the AI is being trained to predict what its utility function would be. In particular, we might assume that future AI systems will be trained in ways that are similar to how current cutting-edge models are trained. However, then we would be making substantial assumptions about what future AI systems will look like. We might wonder if these substantive assumptions now drive the predictions, rendering the utility maximisation framework inert. Additionally, insofar as speculating about how future AI systems will behave is difficult, we might doubt that this approach will give us particularly fruitful insights.

Conclusion

Overall, the failure of these three claims means that any argument leading to the conclusion that we might be able to predict the behaviour of AI systems would need to be more sophisticated. In particular contexts, AI might approximately avoid exploitation – for example, when it is likely to be exploited, or in its interactions with humans and other agents. When combined with further assumptions about behaviour that might come from our understanding of the training processes that will generate advanced AI systems, we might be able to get some idea of how AI systems will behave. We should be modest in our predictions though, because our assumptions are often likely to miss important insights, oversimplify, or even mislead us.

Footnotes

1 In particular, the boards of companies often decide by majority voting. But majority voting does not always result in unexploitable preferences even when every voter does so.

2 One response would be to either assume continuity without using exploitability arguments to justify it, or turn to sophisticated models that do away with the continuity assumption (see Hausner & Wendel, 1952; Hausner, 1953; Fishburn, 1971; McCarthy et al., 2020).

Sources

Peter C. Fishburn (1970). Utility theory for decision making. Wiley.

Peter C. Fishburn (1971). A study of lexicographic expected utility. Management Science 17/11, pages 672–678.

Hausner (1953). Multidimensional Utility. Rand Corporation No. 604151.

Hausner, M., & Wendel, J. G. (1952). Ordered Vector Spaces. Proceedings of the American Mathematical Society 3/6, pages 977–982.

David McCarthy, Kalle Mikkola, and Teruji Thomas (2020). Utilitarianism with and without expected utility. Journal of Mathematical Economics 87, pages  77–113.

Stephen M. Omohundro (2008). The Basic AI Drives. Proceedings of the 2008 conference on Artificial General Intelligence. IOS Press. Edited by Pei Wang, Ben Goertzel and Stan Franklin.

Keith E. Stanovich (2013). Why humans are (sometimes) less rational than other animals: Cognitive complexity and the axioms of rational choice. Thinking & Reasoning 19/1, pages 1–26.

David Thorstad (forthcoming). The accuracy-coherence tradeoff in cognition. The British Journal for the Philosophy of Science.

Iris van Rooij, Cory D. Wright, Johan Kwisthout and Todd Wareham (2018). Rational analysis, intractability, and the prospects of ’as if’-explanations. Synthese 195/2, pages 491–510.

Other paper summaries

Summary: The weight of suffering (Andreas Mogensen)

Does the happiness in this world balance out its suffering, or does misery have the upper hand? In part this is a measurement question: has there been more happiness or suffering in this world to date, and what should we expect the balance to be in the future? But this is also a philosophical question. Even if we knew precisely how much happiness and suffering a possible future for the world would have…

Summary: Staking our future: deontic long-termism and the non-identity problem (Andreas Mogensen)

In “The case for strong longtermism”, Greaves and MacAskill (2021) argue that potential far-future effects are the most important determinant of the value of our options. This is “axiological strong longtermism”. On some views, we can achieve astronomical value by making the future population of worthwhile lives much greater than it would otherwise have been…

Summary: The scope of longtermism (David Thorstad)

Recent work argues for longtermism–the position that often our morally best options will be those with the best long-term consequences. Proponents of longtermism sometimes suggest that in most decisions expected long-term benefits outweigh all short-term effects. In ‘The scope of longtermism’, David Thorstad argues that most of our decisions do not have this character. He identifies three features…