Meg Tong
Publications by Year
Research Areas
Topic Modeling, Natural Language Processing Techniques, Interactive and Immersive Displays, Ethics and Social Impacts of AI, Adversarial Robustness in Machine Learning
Most-Cited Works
- → Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training(2024)31 cited
- → The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"(2023)31 cited
- → Steering Llama 2 via Contrastive Activation Addition(2024)16 cited
- → Taken out of context: On measuring situational awareness in LLMs(2023)8 cited
- → Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming(2025)5 cited
- → Many-shot Jailbreaking(2024)3 cited
- → Auditing language models for hidden objectives(2025)1 cited
- → Forecasting Rare Language Model Behaviors(2025)