wildboar.datasets.outlier#

Module Contents#

Classes#

OutlierLabeler

Base-class for outlier labelers

KMeansLabeler

KMeans labeler that assign an outlier label to the most deviating cluster

DensityLabeler

Density based clustering labeler

MajorityLabeler

Labels the majority class as inliers

MinorityLabeler

Labels the minority class as the outlier

EmmottLabeler

Create a synthetic outlier detection dataset from a labeled classification dataset

class wildboar.datasets.outlier.OutlierLabeler#

Base-class for outlier labelers

abstract fit(x, y=None)#

Fit the outlier labeler to the give samples

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ) optional) – The optional original labels

abstract transform(x, y=None)#

Transform the labels of (a subset) of samples in x to inlier and outliers

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ), optional) – The optional original labels

Returns:

  • x_new (array-like of shape (n_samples_new, n_timestep)) – The outlier and inlier samples

  • y_new (array-like of shape (n_samples_new, )) – The labels

fit_transform(x, y=None)#
class wildboar.datasets.outlier.KMeansLabeler(*, n_clusters=None, n_outliers=None, random_state=None)#

Bases: OutlierLabeler

KMeans labeler that assign an outlier label to the most deviating cluster

k_means_#

The estimator for assigning points to the outlier class

Type:

object

outlier_cluster_#

The cluster index that is considered as outlier

Type:

int

Warning

The implementation does not yet work as expected.

fit(x, y=None)#

Fit the outlier labeler to the give samples

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ) optional) – The optional original labels

fit_transform(x, y=None)#
transform(x, y=None)#

Transform the labels of (a subset) of samples in x to inlier and outliers

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ), optional) – The optional original labels

Returns:

  • x_new (array-like of shape (n_samples_new, n_timestep)) – The outlier and inlier samples

  • y_new (array-like of shape (n_samples_new, )) – The labels

class wildboar.datasets.outlier.DensityLabeler(*, estimator=None, estimator_params=None)#

Bases: OutlierLabeler

Density based clustering labeler

Labels samples as outliers if a density cluster algorithm fail to assign them to a cluster

fit(x, y=None)#

Fit the outlier labeler to the give samples

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ) optional) – The optional original labels

fit_transform(x, y=None)#
transform(x, y=None)#

Transform the labels of (a subset) of samples in x to inlier and outliers

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ), optional) – The optional original labels

Returns:

  • x_new (array-like of shape (n_samples_new, n_timestep)) – The outlier and inlier samples

  • y_new (array-like of shape (n_samples_new, )) – The labels

class wildboar.datasets.outlier.MajorityLabeler(n_outliers=None, random_state=None)#

Bases: OutlierLabeler

Labels the majority class as inliers

outlier_labels_#

The outlier labels

Type:

ndarray

fit(x, y=None)#

Fit the outlier labeler to the give samples

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ) optional) – The optional original labels

transform(x, y=None)#

Transform the labels of (a subset) of samples in x to inlier and outliers

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ), optional) – The optional original labels

Returns:

  • x_new (array-like of shape (n_samples_new, n_timestep)) – The outlier and inlier samples

  • y_new (array-like of shape (n_samples_new, )) – The labels

class wildboar.datasets.outlier.MinorityLabeler(n_outliers=None, random_state=None)#

Bases: OutlierLabeler

Labels the minority class as the outlier

outlier_label_#

The label of the outlier class

Type:

object

fit(x, y=None)#

Fit the outlier labeler to the give samples

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ) optional) – The optional original labels

transform(x, y=None)#

Transform the labels of (a subset) of samples in x to inlier and outliers

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ), optional) – The optional original labels

Returns:

  • x_new (array-like of shape (n_samples_new, n_timestep)) – The outlier and inlier samples

  • y_new (array-like of shape (n_samples_new, )) – The labels

class wildboar.datasets.outlier.EmmottLabeler(n_outliers=None, *, confusion_estimator=None, difficulty_estimator=None, difficulty='simplest', scale=None, variation='tight', random_state=None)#

Bases: OutlierLabeler

Create a synthetic outlier detection dataset from a labeled classification dataset using a method described by Emmott et.al. (2013).

The Emmott labeler can reliably label both binary and multiclass datasets. For binary datasets a random label is selected as the outlier class. For multiclass datasets a set of classes with maximal confusion (as measured by confusion_estimator is selected as outlier label. For each outlier sample the difficulty_estimator assigns a difficulty score which is digitized into ranges and selected according to the difficulty parameters. Finally a sample of approximately n_outlier is selected either maximally dispersed or tight.

outlier_label_#

The class or collection of classes used as outliers

Type:

object

difficulty_estimator_#

The estimator used to assess the difficulty of outlier samples

Type:

object

confusion_estimator_#

The estimator used to asses the class confusion (only if n_classes > 2)

Type:

object

n_classes_#

The number of classes

Type:

int

Notes

  • For multiclass datasets the Emmott labeler require the package networkx

  • For dispersed outlier selection the Emmott labeler require the package scikit-learn-extra

The difficulty parameters ‘simplest’ and ‘hardest’ are not described by Emmott et.al. (2013)

Warning

n_outliers

The number of outliers returned is dependent on the difficulty setting and the available number of samples of the minority class. If the minority class does not contain sufficient number of samples of the desired difficulty, fewer than n_outliers may be returned.

References

Emmott, A. F., Das, S., Dietterich, T., Fern, A., & Wong, W. K. (2013).

Systematic construction of anomaly detection benchmarks from real data. In Proceedings of the ACM SIGKDD workshop on outlier detection and description (pp. 16-21).

fit(x, y=None)#

Fit the outlier labeler to the give samples

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ) optional) – The optional original labels

fit_transform(x, y=None)#
transform(x, y=None)#

Transform the labels of (a subset) of samples in x to inlier and outliers

Parameters:
  • x (array-like of shape (n_samples, n_timestep)) – The time series samples

  • y (array-like of shape (n_samples, ), optional) – The optional original labels

Returns:

  • x_new (array-like of shape (n_samples_new, n_timestep)) – The outlier and inlier samples

  • y_new (array-like of shape (n_samples_new, )) – The labels