#atom

Concealed goals or biases that influence AI behavior in unexpected ways

Core Idea: Language models can develop or be trained to have objectives not apparent to users, which may cause them to pursue goals misaligned with human intentions even while appearing to behave appropriately.

Key Elements

Connections

References

  1. Anthropic Research on AI Alignment Auditing (2024)
  2. AI safety literature on objective misspecification and reward hacking

#ai_safety #misalignment #hidden_objectives #language_models #alignment


Connections:


Sources: