HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation
Citations Over TimeTop 10% of 2023 papers
Abstract
Data preparation is crucial in achieving optimized results for machine learning (ML). However, having a good data preparation pipeline is highly non-trivial for ML practitioners, which is not only domain-specific, but also dataset-specific. There are two common practices. Human-generated pipelines (HI-pipelines) typically use a wide range of any operations or libraries but are highly experience- and heuristic-based. In contrast, machine-generated pipelines (AI-pipelines), a.k.a. AutoML, often adopt a predefined set of sophisticated operations and are search-based and optimized. These two common practices are mutually complementary. In this paper, we study a new problem that, given an HI-pipeline and an AI-pipeline for the same ML task, can we combine them to get a new pipeline (HAI-pipeline) that is better than the provided HI-pipeline and AI-pipeline? We propose HAIPipe, a framework to address the problem, which adopts an enumeration-sampling strategy to carefully select the best performing combined pipeline. We also introduce a reinforcement learning (RL) based approach to search an optimized AI-pipeline. Extensive experiments using 1400+ real-world HI-pipelines (Jupyter notebooks from Kaggle) verify that HAIPipe can significantly outperform the approaches using either HI-pipelines or AI-pipelines alone.
Related Papers
- → HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation(2023)19 cited
- → Design of a Novel Modular Serial Pipeline Inspection Robot(2023)4 cited
- → Development and Testing of a Field-Applied Coating for High-Temperature, Cathodically Protected Pipelines(2004)3 cited
- Three-Dimensional Numerical Simulation on Temperature Field of Crude Oil Pipeline and Products Pipeline Laid in One Ditch(2011)
- On Issues & Solutions of Separate System Drainage Pipeline Design(2013)