Saurav Kadavath
Publications by Year
Research Areas
Topic Modeling, Natural Language Processing Techniques, Software Engineering Research, Machine Learning and Data Classification, Adversarial Robustness in Machine Learning
Most-Cited Works
- → The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization(2021)986 cited
- → Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback(2022)360 cited
- → Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty(2019)343 cited
- → CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence(2022)295 cited
- → Measuring Mathematical Problem Solving With the MATH Dataset(2021)272 cited
- → Language Models (Mostly) Know What They Know(2022)159 cited
- → GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow(2021)137 cited
- → Discovering Language Model Behaviors with Model-Written Evaluations(2023)119 cited
- → Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned(2022)99 cited