Documentation#
CleanVision automatically detects various issues in your image data, such as images that are: (near) duplicates, blurry, over/under-exposed, etc. This data-centric AI package is designed as a quick first step for any computer vision project to find problems in your dataset, which you may want to address before applying machine learning.
Installation#
pip install cleanvision
To install the package with all optional dependencies:
pip install "cleanvision[all]"
pip install git+https://github.com/cleanlab/cleanvision.git
To install the package with all optional dependencies:
pip install "git+https://github.com/cleanlab/cleanvision.git#egg=cleanvision[all]"
How to Use CleanVision#
Basic Usage#
Here’s how to quickly audit your image data:
from cleanvision import Imagelab
# Specify path to folder containing the image files in your dataset
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")
# Automatically check for a predefined list of issues within your dataset
imagelab.find_issues()
# Produce a neat report of the issues found in your dataset
imagelab.report()
Targeted Issue Detection#
You can also focus on specific issues:
issue_types = {"light": {}, "blurry": {}}
imagelab.find_issues(issue_types)
# Produce a report with only the specified issue_types
imagelab.report(issue_types.keys())
Integration with Hugging Face Dataset#
Easily use CleanVision with a Hugging Face dataset:
from datasets import load_dataset, concatenate_datasets
# Download and concatenate different splits
dataset_dict = load_dataset("cifar10")
dataset = concatenate_datasets([d for d in dataset_dict.values()])
# Specify the key for Image feature in dataset.features in `image_key` argument
imagelab = Imagelab(hf_dataset=dataset, image_key="img")
imagelab.find_issues()
imagelab.report()
Integration with Torchvision Dataset#
CleanVision works smoothly with Torchvision datasets too:
from torchvision.datasets import CIFAR10
from torch.utils.data import ConcatDataset
# Download and concatenate train set and test set
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)
dataset = ConcatDataset([train_set, test_set])
imagelab = Imagelab(torchvision_dataset=dataset)
imagelab.find_issues()
imagelab.report()
Additional Resources#
Get started with Starter Tutorial.
View more code examples that demonstrate how to use CleanVision on various datasets.
Interested in contributing to CleanVision? Check out our Contribution Guide to get started.