Matthew Rahtz
Publications by Year
Research Areas
Explainable Artificial Intelligence (XAI), Topic Modeling, Natural Language Processing Techniques, Reinforcement Learning in Robotics, Ethics and Social Impacts of AI
Most-Cited Works
- → Ensembl 2016(2015)1,352 cited
- → Evaluating Frontier Models for Dangerous Capabilities(2024)9 cited
- → A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI(2024)7 cited
- → Tracr: Compiled Transformers as a Laboratory for Interpretability(2023)6 cited
- → Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla(2023)6 cited
- → Safe Deep RL in 3D Environments using Human Feedback(2022)2 cited
- → The Hydra Effect: Emergent Self-repair in Language Model Computations(2023)2 cited
- → An Extensible Interactive Interface for Agent Design(2019)1 cited
- → Truth in the 'killer robots' angle?(2017)