Imagelab#

Imagelab is the core class in CleanVision for finding all types of issues in an image dataset. The methods in this module should suffice for most use-cases, but advanced users can get extra flexibility via the code in other CleanVision modules.

Classes:

Imagelab([data_path, filepaths])

A single class to find all types of issues in image datasets.

class cleanvision.imagelab.Imagelab(data_path=None, filepaths=None)[source]#

Bases: object

A single class to find all types of issues in image datasets. Imagelab detects issues in the raw image files themselves and thus can be useful in most computer vision tasks.

Parameters:

data_path (str) – Path to image files. Imagelab will recursively retrieve all image files from the specified path
filepaths (List[str], optional) – Issue checks will be run on this list of image paths specified in filepaths. Specifying only one of data_path or filepaths.

Variables:

issues (pd.DataFrame) –
Dataframe where each row corresponds to an image and columns specify which issues were detected in this image. It has two types of columns for each issue type:

1. <issue_type>_score - This column contains a quality-score for each image for a particular type of issue. Scores are between 0 and 1, lower values indicate images exhibiting more severe instances of this issue.
1. is_<issue_type>_issue - This column indicates whether or not the issue_type is detected in each image (a binary decision rather than numeric score).
issue_summary (pd.DataFrame) – Dataframe where each row corresponds to a type of issue and columns summarize the overall prevalence of this issue in the dataset. Specifically, it shows the number of images detected with the issue.
info (Dict) – Nested dictionary that contains statistics and other useful information about the dataset. Also contains additional information saved while checking for issues in the dataset.

Raises:

ValueError – If no images are found in the specified paths. If both data_path and filepaths are given or none of them are specified.

Examples

Basic usage of Imagelab class

from cleanvision.imagelab import Imagelab
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")
imagelab.find_issues()
imagelab.report()

Methods:

`list_default_issue_types`()	Prints list of the issue types detected by default if no types are specified in `Imagelab.find_issues()`
`list_possible_issue_types`()	Prints list of all possible issue types that can be detected in a dataset.
`find_issues`([issue_types, n_jobs])	Finds issues in the dataset.
`report`([issue_types, max_prevalence, ...])	Prints summary of the issues found in your dataset.
`visualize`([image_files, issue_types, ...])	Show specific images.
`get_stats`()	Returns dict of statistics computed from images when auditing the data such as: brightness, color space, aspect ratio, etc.
`save`(path[, force])	Saves this ImageLab instance into a folder at the given path.
`load`(path[, data_path])	Loads Imagelab from given path.

list_default_issue_types()[source]#

Prints list of the issue types detected by default if no types are specified in Imagelab.find_issues()

Return type:: None

list_possible_issue_types()[source]#

Prints list of all possible issue types that can be detected in a dataset. This list will also include custom issue types if you properly add them.

Return type:: None

find_issues(issue_types=None, n_jobs=None)[source]#

Finds issues in the dataset. If issue_types is not provided, dataset is checked for a default set of issue types. To see default set: Imagelab.list_default_issue_types()

Parameters:

issue_types (Dict[str, Any], optional) – Dict with issue types to check as keys. The value of this dict is a dict containing hyperparameters for each issue type.
n_jobs (int, default=None) – Number of processing threads used by multiprocessing. Default None sets to the number of cores on your CPU (physical cores if you have psutil package installed, otherwise logical cores). Set this to 1 to disable parallel processing (if its causing issues). Windows users may see a speed-up with n_jobs=1.

Examples

To check for all default issue types use

imagelab.find_issues()

To check for specific issue types with default settings

issue_types = {
    "dark": {},
    "blurry": {}
}
imagelab.find_issues(issue_types)

To check for issue types with different hyperparameters. Different issue types can have different hyperparameters.

issue_types = {
    "dark": {"threshold": 0.1},
    "blurry": {}
}
imagelab.find_issues(issue_types)

Return type:: None

report(issue_types=None, max_prevalence=None, num_images=None, verbosity=1)[source]#

Prints summary of the issues found in your dataset. By default, this method depicts the images representing top-most severe instances of each issue type.

Parameters:

issue_types (List[str], optional) – List of issue types to consider in report. This must be subset of the issue types specified in Imagelab.find_issues`().
max_prevalence (float, default=0.5) – Value between 0 and 1 Issue types that are detected in more than max_prevalence fraction of the images in dataset will be omitted from the report. You are presumably already aware of these in your dataset.
num_images (int, default=4) – Maximum number of images to show for issue type reported. These are examples of the top-most severe instances of the issue in your dataset.
verbosity (int, {1, 2, 3, 4}) – Increasing verbosity increases the detail of the report. Set this to 1 to report less information, or to 4 to report the most information.

Examples

Default usage

imagelab.report()

Report specific issue types

issue_types = ["dark", "near_duplicates"]
imagelab.report(issue_types=issue_types)

Return type:: None

visualize(image_files=None, issue_types=None, num_images=4, cell_size=(2, 2))[source]#

Show specific images.

Can be used for visualizing either: 1. Particular images with paths given in image_files. 2. Images representing top-most severe instances of given issue_types detected the dataset. 3. If no image_files or issue_types are given, random images will be shown from the dataset.

If image_files is given, this overrides the argument issue_types.

Parameters:

image_files (List[str], optional) – List of filepaths for images to visualize.
issue_types (List[str], optional) – List of issue types to visualize. For each type of issue, will show a few images representing the top-most severe instances of this issue in the dataset.
num_images (int, optional) – Number of images to visualize from the dataset. These images are randomly selected if issue_types is None. If issue_types is given, then this is the number of images for each issue type to visualize (images representing top-most severe instances of this issue will be shown). If image_files is given, this argument is ignored.
cell_size (Tuple[int, int], optional) – Dimensions controlling the size of each image in the depicted image grid.

Examples

To visualize random images from the dataset

imagelab.visualize()

imagelab.visualize(num_images=8)

To visualize specific images from the dataset

image_files = ["./dataset/cat.png", "./dataset/dog.png", "./dataset/mouse.png"]
imagelab.visualize(image_files=image_files)

To visualize top examples of specific issue types from the dataset

issue_types = ["dark", "odd_aspect_ratio"]
imagelab.visualize(issue_types=issue_types)

Return type:: None

get_stats()[source]#

Returns dict of statistics computed from images when auditing the data such as: brightness, color space, aspect ratio, etc. If statistics have not been computed yet, then returns None.

Return type:: Any

save(path, force=False)[source]#

Saves this ImageLab instance into a folder at the given path. Your saved Imagelab should be loaded from the same version of the CleanVision package. This method does not save your image files.

Parameters:

path (str) – Path to folder where this Imagelab instance will be saved on disk.
force (bool, default=False) – If set to True, any existing files at path will be overwritten.

Raises:

ValueError – If allow_overwrite is set to False, and an existing path is specified for saving Imagelab instance.

Return type:

None

classmethod load(path, data_path=None)[source]#

Loads Imagelab from given path.

Parameters:

path (str) – Path to the saved Imagelab folder previously specified in Imagelab.save() (not the individual pickle file).
data_path (str) – Path to image dataset previously used in Imagelab. If the data_path is changed, Imagelab will not be loaded as some of its functionalities depend on it. You should be using the same version of the CleanVision package previously used when saving the Imagelab.

Returns:

Returns a saved instance of Imagelab

Return type:

Imagelab