CleanVision automatically detects various issues in image datasets, such as images that are: (near) duplicates, blurry, over/under-exposed, etc. This data-centric AI package is designed as a quick first step for any computer vision project to find problems in your dataset, which you may want to address before applying machine learning.


To install the latest stable version (recommended):

$ pip install cleanvision

To install the bleeding-edge developer version:

$ pip install git+

To install with HuggingFace optional dependencies

$ pip install "cleanvision[huggingface]"

To install with Torchvision optional dependencies

$ pip install "cleanvision[pytorch]"


  1. Using CleanVision to audit your image data is as simple as running the code below:

from cleanvision.imagelab import Imagelab

# Specify path to folder containing the image files in your dataset
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")

# Automatically check for a predefined list of issues within your dataset

# Produce a neat report of the issues found in your dataset
  1. CleanVision diagnoses many types of issues, but you can also check for only specific issues:

issue_types = {"light": {}, "blurry": {}}


# Produce a report with only the specified issue_types
  1. Run CleanVision on a Hugging Face dataset

from datasets import load_dataset, concatenate_datasets

# Download and concatenate different splits
dataset_dict = load_dataset("cifar10")
dataset = concatenate_datasets([d for d in dataset_dict.values()])

# Specify the key for Image feature in dataset.features in `image_key` argument
imagelab = Imagelab(hf_dataset=dataset, image_key="img")

  1. Run CleanVision on a Torchvision dataset

from torchvision.datasets import CIFAR10
from import ConcatDataset

# Download and concatenate train set and test set
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)
dataset = ConcatDataset([train_set, test_set])

imagelab = Imagelab(torchvision_dataset=dataset)

# We set n_jobs=1 as CleanVision parallelization may interfere with torch data loaders.

More on how to get started with CleanVision: