CleanVision

Documentation#

CleanVision automatically detects various issues in image datasets, such as images that are: (near) duplicates, blurry, over/under-exposed, etc. This data-centric AI package is designed as a quick first step for any computer vision project to find problems in your dataset, which you may want to address before applying machine learning.

Installation#

To install the latest stable version (recommended):

$ pip install cleanvision

To install the bleeding-edge developer version:

$ pip install git+https://github.com/cleanlab/cleanvision.git

To install with HuggingFace optional dependencies

$ pip install "cleanvision[huggingface]"

To install with Torchvision optional dependencies

$ pip install "cleanvision[pytorch]"

Quickstart#

  1. Using CleanVision to audit your image data is as simple as running the code below:

from cleanvision.imagelab import Imagelab

# Specify path to folder containing the image files in your dataset
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")

# Automatically check for a predefined list of issues within your dataset
imagelab.find_issues()

# Produce a neat report of the issues found in your dataset
imagelab.report()
  1. CleanVision diagnoses many types of issues, but you can also check for only specific issues:

issue_types = {"light": {}, "blurry": {}}

imagelab.find_issues(issue_types)

# Produce a report with only the specified issue_types
imagelab.report(issue_types.keys())
  1. Run CleanVision on a Hugging Face dataset

from datasets import load_dataset, concatenate_datasets

# Download and concatenate different splits
dataset_dict = load_dataset("cifar10")
dataset = concatenate_datasets([d for d in dataset_dict.values()])

# Specify the key for Image feature in dataset.features in `image_key` argument
imagelab = Imagelab(hf_dataset=dataset, image_key="img")

imagelab.find_issues()

imagelab.report()
  1. Run CleanVision on a Torchvision dataset

from torchvision.datasets import CIFAR10
from torch.utils.data import ConcatDataset

# Download and concatenate train set and test set
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)
dataset = ConcatDataset([train_set, test_set])


imagelab = Imagelab(torchvision_dataset=dataset)

# We set n_jobs=1 as CleanVision parallelization may interfere with torch data loaders.
imagelab.find_issues(n_jobs=1)

imagelab.report()

More on how to get started with CleanVision: