New Galaxy Tool: Identify Label Issues in Machine Learning Datasets Using Cleanlab

By Mohammad Joudy; Anup Kumar

June 13, 2025

Quality control ML datasets
What Can You Do With It?
How It Works

Quality control ML datasets

Machine learning (ML) models are only as good as the data they learn from — and mislabeled data can severely impact model performance. That’s where Cleanlab comes in. We’re excited to announce the integration of Cleanlab, a powerful open-source library for detecting label errors in datasets, into Galaxy by Mohammad Joudy! This integration allows users to run Cleanlab-based detection tool directly within the Galaxy platform.

What Can You Do With It?

Automatically identify mislabeled data using Cleanlab functionality.
Improve ML model accuracy by filtering or correcting noisy labels for both tasks - classification and regression

The image above shows the difference in the classification performance of a few classifiers across a few PMLB ML benchmark datasets. Classification improvements upto 20% are achieved for these datasets.

How It Works

Upload your dataset (with features and labels).
Run Cleanlab Galaxy tool to:

Get a report to identify potential label errors
Clean your original raw dataset by removing the errorneous/low-quality samples to get better quality dataset

The tool makes it easy to clean up datasets in a reproducible way before training your models — a crucial step in any robust ML pipeline.