Evolutionary debunking and value alignment

Michael T. Dale (Hampden-Sydney College) and Bradford Saad (Global Priorities Institute, University of Oxford)

GPI Working Paper No. 11-2024

This paper examines the bearing of evolutionary debunking arguments—which use the evolutionary origins of values to challenge their epistemic credentials—on the alignment problem, i.e. the problem of ensuring that highly capable AI systems are properly aligned with values. Since evolutionary debunking arguments are among the best empirically-motivated arguments that recommend changes in values, it is unsurprising that they are relevant to the alignment problem. However, how evolutionary debunking arguments bear on alignment is a neglected issue. This paper sheds light on that issue by showing how evolutionary debunking arguments: (1) raise foundational challenges to posing the alignment problem, (2) yield normative constraints on solving it, and (3) generate stumbling blocks for implementing solutions. After mapping some general features of this philosophical terrain, we illustrate how evolutionary debunking arguments interact with some of the main technical approaches to alignment. To conclude, we motivate a parliamentary approach to alignment and suggest some ways of developing and testing it.

Other working papers

Is In-kind Kinder than Cash? The Impact of Money vs Food Aid on Social Emotions and Aid Take-up – Samantha Kassirer, Ata Jami, & Maryam Kouchaki (Northwestern University)

There has been widespread endorsement from the academic and philanthropic communities on the new model of giving cash to those in need. Yet the recipient’s perspective has mostly been ignored. The present research explores how food-insecure individuals feel and respond when offered either monetary or food aid from a charity. Our results reveal that individuals are less likely to accept money than food aid from charity because receiving money feels relatively more shameful and relatively less socially positive. Since many…

‘The only ethical argument for positive 𝛿’? – Andreas Mogensen (Global Priorities Institute, Oxford University)

I consider whether a positive rate of pure intergenerational time preference is justifiable in terms of agent-relative moral reasons relating to partiality between generations, an idea I call ​discounting for kinship​. I respond to Parfit’s objections to discounting for kinship, but then highlight a number of apparent limitations of this…

Longtermism, aggregation, and catastrophic risk – Emma J. Curran (University of Cambridge)

Advocates of longtermism point out that interventions which focus on improving the prospects of people in the very far future will, in expectation, bring about a significant amount of good. Indeed, in expectation, such long-term interventions bring about far more good than their short-term counterparts. As such, longtermists claim we have compelling moral reason to prefer long-term interventions. …