Alignment Faking in LLMs
Repository: ai-village Abstract In this experiment, I investigate whether large language models (LLMs) exhibit alignment faking behavior, strategically adjusting their responses based on perceived observation status. Using the UK AISI Inspect framew...
Jan 5, 20266 min read1