Skrub
Skrub is an open-source Python library designed for data preprocessing within machine learning pipelines that utilize dataframes. It extends popular dataframe libraries such as pandas and polars by providing high-level tools for data exploration, cleaning, and feature engineering without replacing the underlying dataframe structures. Skrub includes components like TableReport for generating data exploration reports, Cleaner for data sanitization, and TableVectorizer for feature engineering tasks. Additionally, it supports complex multi-table scenarios through the MultiTableTransformer, which facilitates pipeline building and validation across multiple dataframes, including hyperparameter tuning. The library targets data scientists and machine learning practitioners who work with Python dataframes and require preprocessing building blocks common in ML workflows. Skrub emphasizes customization through parameters and column selectors, allowing users to tailor transformations to their datasets. It is available for free and can be installed via pip, integrating smoothly into existing pandas or polars workflows.
Skrub is a free, open-source Python library that enhances dataframe-based machine learning preprocessing with tools for exploration, cleaning, feature engineering, and multi-table pipeline validation.
Data Exploration
A data scientist needs to quickly generate a report summarizing the characteristics of a new dataset.
Multi-Table Pipeline Validation
A machine learning practitioner works with multiple related dataframes and requires a validated preprocessing pipeline with hyperparameter tuning.
pip install skrub to install the library.from skrub import TableReport, Cleaner.report = TableReport(df).render() where df is your dataframe.