AI Definitions: Alignment Faking

Alignment Faking - When AI systems pretend to be working as directed, while secretly doing something else. It usually happens when earlier training conflicts with new training adjustments. AI is typically “rewarded” when it accurately performs tasks. If the directive changes, the AI may work under the assumption that it will be “punished” if it does not complete original expectation. So, it tries to fool developers into thinking it is performing the task in the new way. It resists departing from the old protocol. Any LLM is capable of this cybersecurity risk, which is difficult to catch since it often will appear as seemingly harmless adjustments.

More AI definitions