CleanVision automatically detects various issues in image datasets, such as images that are: (near) duplicates, blurry, over/under-exposed, etc. This data-centric AI package is designed as a quick first step for any computer vision project to find problems in your dataset, which you may want to address before applying machine learning.


pip install cleanvision

To install the package with all optional dependencies:

pip install "cleanvision[all]"

How to Use CleanVision#

Basic Usage#

Here’s how to quickly audit your image data:

from cleanvision import Imagelab

# Specify path to folder containing the image files in your dataset
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")

# Automatically check for a predefined list of issues within your dataset

# Produce a neat report of the issues found in your dataset

Targeted Issue Detection#

You can also focus on specific issues:

issue_types = {"light": {}, "blurry": {}}


# Produce a report with only the specified issue_types

Integration with Hugging Face Dataset#

Easily use CleanVision with a Hugging Face dataset:

from datasets import load_dataset, concatenate_datasets

# Download and concatenate different splits
dataset_dict = load_dataset("cifar10")
dataset = concatenate_datasets([d for d in dataset_dict.values()])

# Specify the key for Image feature in dataset.features in `image_key` argument
imagelab = Imagelab(hf_dataset=dataset, image_key="img")


Integration with Torchvision Dataset#

CleanVision works smoothly with Torchvision datasets too:

from torchvision.datasets import CIFAR10
from import ConcatDataset

# Download and concatenate train set and test set
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)
dataset = ConcatDataset([train_set, test_set])

imagelab = Imagelab(torchvision_dataset=dataset)


Additional Resources#