Pedro Ortiz Suárez
German Research Centre for Artificial Intelligence(DE)
Publications by Year
Research Areas
Natural Language Processing Techniques, Topic Modeling, Text Readability and Simplification, Web Data Mining and Analysis, Authorship Attribution and Profiling
Most-Cited Works
- → CamemBERT: a Tasty French Language Model(2020)702 cited
- → Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures(2019)238 cited
- → Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets(2022)163 cited
- → The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset(2023)65 cited
- → Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell(2020)37 cited
- → Tokenizer Choice For LLM Training: Negligible or Crucial?(2024)23 cited
- → Automatic extraction of materials and properties from superconductors scientific literature(2022)22 cited
- → Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus(2021)17 cited
- → Towards a Cleaner Document-Oriented Multilingual Crawled Corpus(2022)10 cited
- BERTrade: Using Contextual Embeddings to Parse Old French(2022)