What is "wireheading"?

2 min read

Suggest changes in Google Docs

Wireheading happens when an agent’s reward mechanism is stimulated directly, instead of through the agent’s goals being achieved in the world.

The term comes from experiments in which rats with electrodes implanted into their brains could activate their pleasure centers at the press of a button. Some of the rats repeatedly pressed the pleasure button until they died of hunger or thirst.

In the context of AI safety, there are two ways wireheading might be relevant:

The AI may wirehead itself by accessing its reward function directly and setting its reward to its maximum value.¹ This could be benign if it caused the AI to simply stop doing anything, but it could be problematic if it caused the AI to take actions to ensure that we don’t stop it from wireheading.
Some thought experiments in which an AI is given very simple goals, like "make humans happy", conclude that this might lead an AI to wirehead humans (e.g., by pumping us full of heroin).

Both of these are examples of outer misalignment. Another (overlapping) term for an agent interfering with its reward is “reward tampering”.