Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure
Citations Over TimeTop 1% of 2021 papers
Abstract
Datasets that power machine learning are often used, shared, and reused with little visibility into the processes of deliberation that led to their creation. As artificial intelligence systems are increasingly used in high-stakes tasks, system development and deployment practices must be adapted to address the very real consequences of how model development data is constructed and used in practice. This includes greater transparency about data, and accountability for decisions made when developing it. In this paper, we introduce a rigorous framework for dataset development transparency that supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields documents that facilitate improved communication and decision-making, as well as drawing attention to the value and necessity of careful data work. The proposed framework makes visible the often overlooked work and decisions that go into dataset creation, a critical step in closing the accountability gap in artificial intelligence and a critical/necessary resource aligned with recent work on auditing processes.
Related Papers
- → Activists’ Views of Deliberation(2007)55 cited
- → Deliberation across deep divisions: Transformative moments(2016)13 cited
- → AL-QUR'AN INSIGHTS ABOUT MUSYAWARAH (A Study of Maudhu'iy Commentary on Deliberation)(2021)6 cited
- → Deliberation and Spontaneity(2018)
- → Aristotle on Deliberation and Contingency(2018)