Guilherme Penedo
Publications by Year
Research Areas
Natural Language Processing Techniques, Topic Modeling, Hate Speech and Cyberbullying Detection, Mathematics, Computing, and Information Processing, Speech Recognition and Synthesis
Most-Cited Works
- → The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only(2023)156 cited
- → The Falcon Series of Open Language Models(2023)112 cited
- → The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale(2024)19 cited
- → AlGhafa Evaluation Benchmark for Arabic Language Models(2023)8 cited
- → SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model(2025)6 cited
- → Towards Best Practices for Open Datasets for LLM Training(2025)2 cited
- → Artery in Microgravity (AIM): Assembly, integration, and testing for a student payload for the ISS(2022)
- → DataTrove: large scale data processing(2026)