New Galaxy Tool: Identify Label Issues in Machine Learning Datasets Using Cleanlab
Quality control ML datasets
Machine learning (ML) models are only as good as the data they learn from — and mislabeled data can severely impact model performance. That’s where Cleanlab comes in. We’re excited to announce the integration of Cleanlab, a powerful open-source library for detecting label errors in datasets, into Galaxy by Mohammad Joudy! This integration allows users to run Cleanlab-based detection tool directly within the Galaxy platform.
What Can You Do With It?
- Automatically identify mislabeled data using Cleanlab functionality.
- Improve ML model accuracy by filtering or correcting noisy labels for both tasks - classification and regression
The image above shows the difference in the classification performance of a few classifiers across a few PMLB ML benchmark datasets. Classification improvements upto 20% are achieved for these datasets.
How It Works
- Upload your dataset (with features and labels).
- Run Cleanlab Galaxy tool to:
- Get a report to identify potential label errors
- Clean your original raw dataset by removing the errorneous/low-quality samples to get better quality dataset
The tool makes it easy to clean up datasets in a reproducible way before training your models — a crucial step in any robust ML pipeline.