Buck Shlegeris
Publications by Year
Research Areas
Adversarial Robustness in Machine Learning, Ethics and Social Impacts of AI, Topic Modeling, Natural Language Processing Techniques, Human-Automation Interaction and Safety
Most-Cited Works
- → Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training(2024)31 cited
- → Alignment faking in large language models(2024)17 cited
- → Adversarial Training for High-Stakes Reliability(2022)9 cited
- → Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models(2024)4 cited
- → Polysemanticity and Capacity in Neural Networks(2022)4 cited
- → AI Control: Improving Safety Despite Intentional Subversion(2023)3 cited
- → Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety(2025)2 cited
- → Sabotage Evaluations for Frontier Models(2024)2 cited
- → Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats(2024)1 cited
- → Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols(2024)1 cited