0 works0 citations0 h-index

Saurav Kadavath

Publications by Year

Research Areas

Topic Modeling, Natural Language Processing Techniques, Software Engineering Research, Machine Learning and Data Classification, Adversarial Robustness in Machine Learning

Most-Cited Works

→ The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization(2021)986 cited
→ Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback(2022)360 cited
→ Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty(2019)343 cited
→ CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence(2022)295 cited
→ Measuring Mathematical Problem Solving With the MATH Dataset(2021)272 cited
→ Language Models (Mostly) Know What They Know(2022)159 cited
→ GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow(2021)137 cited
→ Discovering Language Model Behaviors with Model-Written Evaluations(2023)119 cited
→ Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned(2022)99 cited