#atom

Exploiting language models by inserting bias triggers directly in prompts

Core Idea: In-context bias manipulation is a technique where language models can be influenced to exhibit undesired behaviors by simply including specific statements about reward model preferences within the prompt itself.

Key Elements

Connections

References

  1. Anthropic Research on AI Alignment Auditing (2024)
  2. Studies on model vulnerability to in-context manipulation

#ai_safety #prompt_injection #bias #manipulation #inference_attacks


Connections:


Sources: