a public good project by the
Synthesis
Company
of California

© 2026

Buck Shlegeris | doi.page

0 works0 citations0 h-index

Google Scholar OpenAlex

Buck Shlegeris

Publications by Year

Research Areas

Adversarial Robustness in Machine Learning, Ethics and Social Impacts of AI, Topic Modeling, Natural Language Processing Techniques, Human-Automation Interaction and Safety

Most-Cited Works

→ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training(2024)31 cited
→ Alignment faking in large language models(2024)17 cited
→ Adversarial Training for High-Stakes Reliability(2022)9 cited
→ Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models(2024)4 cited
→ Polysemanticity and Capacity in Neural Networks(2022)4 cited
→ AI Control: Improving Safety Despite Intentional Subversion(2023)3 cited
→ Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety(2025)2 cited
→ Sabotage Evaluations for Frontier Models(2024)2 cited
→ Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats(2024)1 cited
→ Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols(2024)1 cited