a public good project by the
Synthesis
Company
of California

© 2026

Meg Tong | doi.page

0 works0 citations0 h-index

Google Scholar OpenAlex

Meg Tong

Publications by Year

Research Areas

Topic Modeling, Natural Language Processing Techniques, Interactive and Immersive Displays, Ethics and Social Impacts of AI, Adversarial Robustness in Machine Learning

Most-Cited Works

→ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training(2024)31 cited
→ The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"(2023)31 cited
→ Steering Llama 2 via Contrastive Activation Addition(2024)16 cited
→ Taken out of context: On measuring situational awareness in LLMs(2023)8 cited
→ Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming(2025)5 cited
→ Many-shot Jailbreaking(2024)3 cited
→ Auditing language models for hidden objectives(2025)1 cited
→ Forecasting Rare Language Model Behaviors(2025)