Neel Nanda
Publications by Year
Research Areas
Topic Modeling, Natural Language Processing Techniques, Advanced Neural Network Applications, Semantic Web and Ontologies, Sentiment Analysis and Opinion Mining
Most-Cited Works
- → Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback(2022)360 cited
- → Open Problems in Mechanistic Interpretability(2025)5 cited
- → Linear Representations of Sentiment in Large Language Models(2023)5 cited
- → Language Models Linearly Represent Sentiment(2024)3 cited
- → Confidence Regulation Neurons in Language Models(2024)1 cited
- → SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability(2025)
- → Convergent Linear Representations of Emergent Misalignment(2025)
- → Are Sparse Autoencoders Useful? A Case Study in Sparse Probing(2025)
- → Reasoning-Finetuning Repurposes Latent Representations in Base Models(2025)