New Galaxy Tool: Identify Label Issues in Machine Learning Datasets Using Cleanlab
Identify Label Issues in Machine Learning Datasets Using Cleanlab
On this page
Quality control ML datasets
Machine learning (ML) models are only as good as the data they learn from — and mislabeled data can severely impact model performance. That’s where Cleanlab comes in. We’re excited to announce the integration of Cleanlab, a powerful open-source library for detecting label errors in datasets, into Galaxy by Mohammad Joudy! This integration allows users to run Cleanlab-based detection tool directly within the Galaxy platform.
What Can You Do With It?
-
Automatically identify mislabeled data using Cleanlab functionality.
-
Improve ML model accuracy by filtering or correcting noisy labels for both tasks - classification and regression

The image above shows the difference in the classification performance of a few classifiers across a few PMLB ML benchmark datasets. Classification improvements upto 20% are achieved for these datasets.
How It Works
-
Upload your dataset (with features and labels).
-
Run Cleanlab Galaxy tool to:
-
Get a report to identify potential label errors
-
Clean your original raw dataset by removing the errorneous/low-quality samples to get better quality dataset
The tool makes it easy to clean up datasets in a reproducible way before training your models — a crucial step in any robust ML pipeline.