#ai-safety-and-alignment

Alignment Faking in LLMs

Repository: ai-village Abstract In this experiment, I investigate whether large language models (LLMs) exhibit alignment faking behavior, strategically adjusting their responses based on perceived observation status. Using the UK AISI Inspect framew...

Jan 5, 20266 min read1

Command Palette