wildboar.datasets.outlier#

Utilities for generating synthetic outlier datasets.

See the User Guide for more details and example uses.

Module Contents#

Functions#

density_outliers(x, *[, n_outliers, method, eps, ...])

Densitiy based outlier generation.

emmott_outliers(x, y, *[, n_outliers, ...])

Difficulty based outlier generation.

kmeans_outliers(x, *[, n_outliers, n_clusters, ...])

K-mean outlier generation.

majority_outliers(x, y, *[, n_outliers, random_state])

Label the majority class as inliers.

minority_outliers(x, y, *[, n_outliers, random_state])

Label (a fraction of) the minority class as the outlier.

wildboar.datasets.outlier.density_outliers(x, *, n_outliers=0.05, method='dbscan', eps=2.0, min_sample=5, metric='euclidean', max_eps=np.inf, random_state=None)[source]#

Densitiy based outlier generation.

Labels samples as outliers if a density cluster algorithm fail to assign them to a cluster.

Parameters:
xndarray of shape (n_samples, n_timestep)

The input samples.

n_outliersfloat, optional

The number of outlier samples expressed as a fraction of the inlier samples. By default, all samples of the minority class is considered as outliers.

method{“dbscan”, “optics”}, optional

The density based clustering method.

epsfloat, optional

The eps parameter, when method=”dbscan”.

min_sampleint, optional

The min_sample parameter to the cluter method.

metricstr, optional

The metric parameter to the cluster method.

max_epsfloat, optional

The max_eps parameter, when method=”optics”.

random_stateint or RandomState
  • If int, random_state is the seed used by the random number generator

  • If RandomState instance, random_state is the random number generator

  • If None, the random number generator is the RandomState instance used

    by np.random.

Returns:
x_outlierndarray of shape (n_inliers + n_outliers, n_timestep)

The samples.

y_outlierndarray of shape (n_inliers + n_outliers, )

The inliers (labeled as 1) and outlier (labled as -1).

wildboar.datasets.outlier.emmott_outliers(x, y, *, n_outliers=None, confusion_estimator=None, difficulty_estimator=None, transform='interval', difficulty='simplest', scale=None, variation='tight', random_state=None)[source]#

Difficulty based outlier generation.

Create a synthetic outlier detection dataset from a labeled classification dataset using the method described by Emmott et.al. (2013).

The Emmott labeler can reliably label both binary and multiclass datasets. For binary datasets a random label is selected as the outlier class. For multiclass datasets a set of classes with maximal confusion (as measured by confusion_estimator is selected as outlier label. For each outlier sample the difficulty_estimator assigns a difficulty score which is digitized into ranges and selected according to the difficulty parameters. Finally a sample of approximately n_outlier is selected either maximally dispersed or tight.

Parameters:
xndarray of shape (n_samples, n_timestep)

The input samples.

yndarray of shape (n_samples, )

The input labels.

n_outliersfloat, optional

The number of outlier samples expressed as a fraction of the inlier samples.

  • if float, the number of outliers are guaranteed but an error is raised if the the requested difficulty has to few samples or the labels selected for the outlier label has to few samples.

confusion_estimatorobject, optional

Estimator of class confusion for datasets where n_classes > 2. Default to a random forest classifier.

difficulty_estimatorobject, optional

Estimator for sample difficulty. The difficulty estimator must support predict_proba. Defaults to a kernel logistic regression model with a RBF-kernel.

transform‘interval’ or Transform, optional

Transform x before the confusion and difficulty estimator.

  • if None, no transformation is applied.

  • if ‘interval’, use the transform.IntervalTransform with default parameters.

  • otherwise, use the supplied transform.

difficulty{‘any’, ‘simplest’, ‘hardest’}, int or array-like, optional

The difficulty of the outlier points quantized according to scale. The value should be in the range [1, len(scale)] with lower difficulty denoting simpler outliers. If an array is given, multiple difficulties can be included, e.g., [1, 4] would mix easy and difficult outliers.

  • if ‘any’ outliers are sampled from all scores.

  • if ‘simplest’ the simplest n_outliers are selected.

  • if ‘hardest’ the hardest n_outliers are selected.

scaleint or array-like, optional

The scale of quantized difficulty scores. Defaults to [0, 0.16, 0.3, 0.5]. Scores (which are probabilities in the range [0, 1]) are fit into the ranges using np.digitize(difficulty, scale).

  • if int, use scale percentiles based in the difficulty scores.

variation{‘tight’, ‘dispersed’}, optional

Selection procedure for sampling outlier samples. If difficulty=”simplest” or difficulty=”hardest”, this parameter has no effect.

  • if ‘tight’ a pivot point is selected and the n_outlier closest samples are selected according to their euclidean distance.

  • if ‘dispersed’ n_outlier points are selected according to a facility location algorithm such that they are distributed among the outliers.

random_stateint or RandomState
  • If int, random_state is the seed used by the random number generator

  • If RandomState instance, random_state is the random number generator

  • If None, the random number generator is the RandomState instance used

    by np.random.

Returns:
x_outlierndarray of shape (n_inliers + n_outliers, n_timestep)

The samples.

y_outlierndarray of shape (n_inliers + n_outliers, )

The inliers (labeled as 1) and outlier (labled as -1).

Warning

n_outliers

The number of outliers returned is dependent on the difficulty setting and the available number of samples of the minority class. If the minority class does not contain sufficient number of samples of the desired difficulty, fewer than n_outliers may be returned.

Notes

  • For multiclass datasets the Emmott labeler require the package networkx

The difficulty parameters ‘simplest’ and ‘hardest’ are not described by Emmott et.al. (2013)

References

Emmott, A. F., Das, S., Dietterich, T., Fern, A., & Wong, W. K. (2013).

Systematic construction of anomaly detection benchmarks from real data. In Proceedings of the ACM SIGKDD workshop on outlier detection and description (pp. 16-21).

wildboar.datasets.outlier.kmeans_outliers(x, *, n_outliers=0.05, n_clusters=5, random_state=None)[source]#

K-mean outlier generation.

Label the samples of the cluster farthers from the other clusters as outliers.

Parameters:
xndarray of shape (n_samples, n_timestep)

The input samples.

n_outliersfloat, optional

The number of outlier samples expressed as a fraction of the inlier samples.

  • if float, the number of outliers are guaranteed but an error is raised if no cluster can satisfy the constraints. Lowering the n_cluster parameter to allow for more samples per cluster.

n_clustersint, optional

The number of clusters.

random_stateint or RandomState
  • If int, random_state is the seed used by the random number generator

  • If RandomState instance, random_state is the random number generator

  • If None, the random number generator is the RandomState instance used

    by np.random.

Returns:
x_outlierndarray of shape (n_inliers + n_outliers, n_timestep)

The samples.

y_outlierndarray of shape (n_inliers + n_outliers, )

The inliers (labeled as 1) and outlier (labled as -1).

wildboar.datasets.outlier.majority_outliers(x, y, *, n_outliers=0.05, random_state=None)[source]#

Label the majority class as inliers.

Parameters:
xndarray of shape (n_samples, n_timestep)

The input samples.

yndarray of shape (n_samples, )

The input labels.

n_outliersfloat, optional

The number of outlier samples expressed as a fraction of the inlier samples.

random_stateint or RandomState
  • If int, random_state is the seed used by the random number generator

  • If RandomState instance, random_state is the random number generator

  • If None, the random number generator is the RandomState instance used

    by np.random.

Returns:
x_outlierndarray of shape (n_inliers + n_outliers, n_timestep)

The samples.

y_outlierndarray of shape (n_inliers + n_outliers, )

The inliers (labeled as 1) and outlier (labled as -1).

wildboar.datasets.outlier.minority_outliers(x, y, *, n_outliers=0.05, random_state=None)[source]#

Label (a fraction of) the minority class as the outlier.

Parameters:
xndarray of shape (n_samples, n_timestep)

The input samples.

yndarray of shape (n_samples, )

The input labels.

n_outliersfloat, optional

The number of outlier samples expressed as a fraction of the inlier samples.

  • if float, the number of outliers are guaranteed but an error is raised if the minority class has to few samples.

random_stateint or RandomState
  • If int, random_state is the seed used by the random number generator

  • If RandomState instance, random_state is the random number generator

  • If None, the random number generator is the RandomState instance used

    by np.random.

Returns:
x_outlierndarray of shape (n_inliers + n_outliers, n_timestep)

The samples.

y_outlierndarray of shape (n_inliers + n_outliers, )

The inliers (labeled as 1) and outlier (labled as -1).