Daniel Paleka
Publications by Year
Research Areas
Topic Modeling, Natural Language Processing Techniques, Adversarial Robustness in Machine Learning, Spam and Phishing Detection, Explainable Artificial Intelligence (XAI)
Most-Cited Works
- → Poisoning Web-Scale Training Datasets is Practical(2024)78 cited
- → Red-Teaming the Stable Diffusion Safety Filter(2022)25 cited
- → Foundational Challenges in Assuring Alignment and Safety of Large Language Models(2024)14 cited
- → ARB: Advanced Reasoning Benchmark for Large Language Models(2023)14 cited
- → Refusal in Language Models Is Mediated by a Single Direction(2024)9 cited
- → Evaluating Superhuman Models with Consistency Checks(2024)8 cited
- → Stealing Part of a Production Language Model(2024)7 cited