Finding and Fixing Label Errors in Classification Datasets

Introduction

If you’ve ever used datasets like CIFAR, MNIST, ImageNet, or IMDB, you likely assumed the class labels are correct. Supervised ML often assumes that the labels we train our model on are correct, but recent studies have discovered that even highly-curated ML benchmark datasets are full of label errors. Furthermore, the labels in datasets from real-world applications can be of far lower quality. There are several factors that come into play that lead to errors in the dataset, such as a human error made while annotating the examples. These days, it is increasingly the training data, not the models or infrastructure, that decides whether machine learning will be a success or failure. However, training our ML models to predict fundamentally flawed labels seems problematic. Even worse, we might train and evaluate these models with flawed labels and deploy the resulting models at scale.

Deep Lake community has uploaded a variety of popular machine learning datasets like CIFAR-10, MNIST or Fashion-MNIST, and ImageNet. Without any need to download, these datasets can be accessed and streamed with Deep Lake with one line of code, in seconds. This enables you to explore the datasets and train models without needing to download machine learning datasets regardless of their size. However, most of these datasets contain label errors. This becomes especially problematic when these errors reach test sets, the subsets of datasets used to validate the trained model. For example, label errors comprise at least 6% of the ImageNet test set. What can we do about this?

In this post, we will touch on some of the reasons why labeling errors happen, why the errors in labels are imperative and what tools and techniques can be used to overcome these errors. Then, we will quickly dive into confident learning, an algorithm that helps to discover label issues in any ML dataset. Here at Deep Lake, we’ve run a series of experiments with the algorithm to demonstrate how label noise in datasets might impact the downstream ML model. We’ll then take a look at how we can use Cleanlab, a confident learning implementation, to easily find noise in Deep Lake datasets.

Reasons for Labeling Errors

Today, most practical machine learning models utilize supervised learning. For supervised learning to work, you need a labeled set of data from which the model can learn to make correct decisions. Data labeling typically starts by asking humans to make judgments about a given piece of unlabeled data. For example, labelers may be asked to tag all the images in a dataset where an image contains a car. In practice, labelers usually have some knowledge of the context of data.

Why are there errors present in the labels? One broad class for these errors is software bugs. These can be controlled and managed using good software best practices like tests covering software and data. The other class of errors is the one where the mistakes come from the labelers while annotating the examples. These errors are usually much harder to track. In machine learning, a properly labeled dataset used as the objective standard to train and assess a given model is often called ground truth. The accuracy of your trained model will depend on the accuracy of your ground truth, so spending the time and resources to ensure highly accurate data labeling is essential.

What is Cleanlab?

Cleanlab is a library that automatically finds and fixes errors in any ML dataset. Under the hood, it uses Confident Learning (CL) algorithm to detect label errors. Confident Learning is all about using all the valuable information we have to find noise in the dataset and improve the quality of a dataset. It's about using one input, the labelers, and testing it using another input, the model, which acts as a noise estimator.

CL is a class of learning where the focus is to learn well despite some noise in the dataset. This is achieved by accurately and directly characterizing the uncertainty of label noise in the data. The foundation CL depends on is that label noise is class-conditional, depending only on the true latent class, not the data. For instance, a leopard is likely mistakenly labeled a jaguar. Therefore, data has implications for the labeler's decisions. Another assumption CL makes is that the dataset's error rate is less than 50%.

In CL, a reasonably performant model is used to estimate the errors in the dataset. First, the model's prediction is obtained. Then, using a class-specific threshold setting confident joint distribution matrix is obtained. The matrix is then normalized to get an estimation of the error matrix. This estimated error matrix builds the foundation for dataset pruning, counting, and ranking samples in the dataset.

Confident Learning algorithm

Confident Learning algorithm

Label Errors Impact on Model Performance

To measure the impact of label errors, we’ll need to find a metric to optimize for. Accuracy is one metric for evaluating supervised models, which is the fraction of predictions that a model got right. Now, to quantify the impact of label errors, we’ll need a way to benchmark a model trained on clean data and a model trained on noisy data. By looking at the accuracy of the predictions made by a model trained on noisy data and on clean data, we can compare the approaches against each other. Let’s suppose we can compare this metric on a fixed model trained on some dataset. However, we don’t necessarily know the ground truth, as there could already be errors in the dataset. We can only be confident if we introduce these label errors ourselves, assuming that we do it on a dataset with a relatively low label error rate in the first place. To overcome this, we can artificially introduce some noise to a dataset that we assume has a low rate of label errors. We can introduce random noise to the training set by randomly flipping the labels.

For the experiment, let’s use Fashion MNIST dataset and assume that it has a low rate of errors. Then, we gradually introduce 5% of the noise at each step, comparing the performance of Baseline and Cleanlab in parallel. In this case, Baseline is a model trained on noisy data, while Cleanlab is a model trained on clean data. Here’s an example: if we have 60,000 samples in the dataset, at 10% of the noise, we would randomly flip 6,000 labels. The Baseline would then be trained on all 60,000 samples, but the Cleanlab would be trained only on the examples that weren’t classified as erroneous.

Accuracy between Baseline and Cleanlab over a range noise levels.

Accuracy between Baseline and Cleanlab over a range noise levels.

It looks like pruning the noisy samples works well, as Cleanlab accurately catches the erroneous labels introduced at each noise level. We can also see that the accuracy stays around 85% with Cleanlab across a range of noise levels, while with the random noise (without removing any samples), it drops at a much higher rate to a low of 65% at the highest noise level.