a public good project by the
Synthesis
Company
of California

© 2026

Neel Nanda | doi.page

0 works0 citations0 h-index

Google Scholar OpenAlex

Neel Nanda

Publications by Year

Research Areas

Topic Modeling, Natural Language Processing Techniques, Advanced Neural Network Applications, Semantic Web and Ontologies, Sentiment Analysis and Opinion Mining

Most-Cited Works

→ Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback(2022)360 cited
→ Open Problems in Mechanistic Interpretability(2025)5 cited
→ Linear Representations of Sentiment in Large Language Models(2023)5 cited
→ Language Models Linearly Represent Sentiment(2024)3 cited
→ Confidence Regulation Neurons in Language Models(2024)1 cited
→ SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability(2025)
→ Convergent Linear Representations of Emergent Misalignment(2025)
→ Are Sparse Autoencoders Useful? A Case Study in Sparse Probing(2025)
→ Reasoning-Finetuning Repurposes Latent Representations in Base Models(2025)