wildboar.datasets.outlier#
Module Contents#
Classes#
A simple kernel logistic implementation using a Nystroem kernel approximation |
Functions#
|
Labels samples as outliers if a density cluster algorithm fail to assign them to |
|
Create a synthetic outlier detection dataset from a labeled classification |
|
Label the samples of the cluster farthers from the other clusters as outliers. |
|
Labels the majority class as inliers |
|
Labels (a fraction of) the minority class as the outlier. |
- class wildboar.datasets.outlier.KernelLogisticRegression(kernel=None, *, kernel_params=None, n_components=100, penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)[source]#
Bases:
sklearn.linear_model.LogisticRegressionA simple kernel logistic implementation using a Nystroem kernel approximation
See also
wildboar.datasets.outlier.EmmottLabelerSynthetic outlier dataset construction
- Parameters:
kernel (str, optional) – The kernel function to use. See sklearn.metrics.pairwise.kernel_metric for kernels. The default kernel is ‘rbf’.
kernel_params (dict, optional) – Parameters to the kernel function.
n_components (int, optional) – Number of features to construct
- decision_function(x)[source]#
Predict confidence scores for samples.
The confidence score for a sample is proportional to the signed distance of that sample to the hyperplane.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data matrix for which we want to get the confidence scores.
- Returns:
scores – Confidence scores per (n_samples, n_classes) combination. In the binary case, confidence score for self.classes_[1] where >0 means this class would be predicted.
- Return type:
ndarray of shape (n_samples,) or (n_samples, n_classes)
- fit(x, y, sample_weight=None)[source]#
Fit the model according to the given training data.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Target vector relative to X.
sample_weight (array-like of shape (n_samples,) default=None) –
Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.
New in version 0.17: sample_weight support to LogisticRegression.
- Returns:
Fitted estimator.
- Return type:
self
Notes
The SAGA solver supports both float64 and float32 bit arrays.
- wildboar.datasets.outlier.density_outliers(x, y=None, *, n_outliers=0.05, method='dbscan', eps=2.0, min_sample=5, metric='euclidean', max_eps=np.inf, random_state=None)[source]#
Labels samples as outliers if a density cluster algorithm fail to assign them to a cluster
- Parameters:
x (ndarray of shape (n_samples, n_timestep)) – The input samples
y (ndarray of shape (n_samples, ), optional) – Ignored.
n_outliers (float, optional) – The number of outlier samples expressed as a fraction of the inlier samples. By default, all samples of the minority class is considered as outliers.
method ({"dbscan", "optics"}, optional) – The density based clustering method.
eps (float, optional) – The eps parameter, when
method="dbscan".min_sample (int, optional) – The
min_sampleparameter to the cluter methodmetric (str, optional) – The
metricparameter to the cluster methodmax_eps (float, optional) – The
max_epsparameter, whenmethod="optics".
- Returns:
x_outlier (ndarray of shape (n_inliers + n_outliers, n_timestep)) – The samples
y_outlier (ndarray of shape (n_inliers + n_outliers, )) – The inliers (labeled as 1) and outlier (labled as -1)
- wildboar.datasets.outlier.emmott_outliers(x, y, *, n_outliers=None, confusion_estimator=None, difficulty_estimator=None, transform='interval', difficulty='simplest', scale=None, variation='tight', random_state=None)[source]#
Create a synthetic outlier detection dataset from a labeled classification dataset using the method described by Emmott et.al. (2013).
The Emmott labeler can reliably label both binary and multiclass datasets. For binary datasets a random label is selected as the outlier class. For multiclass datasets a set of classes with maximal confusion (as measured by
confusion_estimatoris selected as outlier label. For each outlier sample thedifficulty_estimatorassigns a difficulty score which is digitized into ranges and selected according to thedifficultyparameters. Finally a sample of approximatelyn_outlieris selected either maximally dispersed or tight.- Parameters:
x (ndarray of shape (n_samples, n_timestep)) – The input samples
y (ndarray of shape (n_samples, )) – The input labels.
n_outliers (float, optional) –
The number of outlier samples expressed as a fraction of the inlier samples.
if float, the number of outliers are guaranteed but an error is raised if the the requested difficulty has to few samples or the labels selected for the outlier label has to few samples.
confusion_estimator (object, optional) – Estimator of class confusion for datasets where
n_classes > 2. Default to a random forest classifier.difficulty_estimator (object, optional) – Estimator for sample difficulty. The difficulty estimator must support
predict_proba. Defaults to a kernel logistic regression model with a RBF-kernel.transform ('interval' or Transform, optional) –
Transform x before the confusion and difficulty estimator.
if None, no transformation is applied.
if ‘interval’, use the
transform.IntervalTransformwith default parameters.otherwise, use the supplied transform
difficulty ({'any', 'simplest', 'hardest'}, int or array-like, optional) –
The difficulty of the outlier points quantized according to scale. The value should be in the range
[1, len(scale)]with lower difficulty denoting simpler outliers. If an array is given, multiple difficulties can be included, e.g.,[1, 4]would mix easy and difficult outliers.if ‘any’ outliers are sampled from all scores
if ‘simplest’ the simplest n_outliers are selected
if ‘hardest’ the hardest n_outliers are selected
scale (int or array-like, optional) –
The scale of quantized difficulty scores. Defaults to
[0, 0.16, 0.3, 0.5]. Scores (which are probabilities in the range [0, 1]) are fit into the ranges usingnp.digitize(difficulty, scale).if int, use scale percentiles based in the difficulty scores.
variation ({'tight', 'dispersed'}, optional) –
Selection procedure for sampling outlier samples. If
difficulty="simplest"ordifficulty="hardest", this parameter has no effect.if ‘tight’ a pivot point is selected and the
n_outlierclosest samples are selected according to their euclidean distance.if ‘dispersed’
n_outlierpoints are selected according to a facility location algorithm such that they are distributed among the outliers.
random_state (int or RandomState) –
If int, random_state is the seed used by the random number generator
If RandomState instance, random_state is the random number generator
- If None, the random number generator is the RandomState instance used
by np.random.
- Returns:
x_outlier (ndarray of shape (n_inliers + n_outliers, n_timestep)) – The samples
y_outlier (ndarray of shape (n_inliers + n_outliers, )) – The inliers (labeled as 1) and outlier (labled as -1)
Notes
For multiclass datasets the Emmott labeler require the package networkx
For dispersed outlier selection the Emmott labeler require the package scikit-learn-extra
The difficulty parameters ‘simplest’ and ‘hardest’ are not described by Emmott et.al. (2013)
Warning
- n_outliers
The number of outliers returned is dependent on the difficulty setting and the available number of samples of the minority class. If the minority class does not contain sufficient number of samples of the desired difficulty, fewer than n_outliers may be returned.
References
- Emmott, A. F., Das, S., Dietterich, T., Fern, A., & Wong, W. K. (2013).
Systematic construction of anomaly detection benchmarks from real data. In Proceedings of the ACM SIGKDD workshop on outlier detection and description (pp. 16-21).
- wildboar.datasets.outlier.kmeans_outliers(x, y=None, *, n_outliers=0.05, n_clusters=5, random_state=None)[source]#
Label the samples of the cluster farthers from the other clusters as outliers.
- Parameters:
x (ndarray of shape (n_samples, n_timestep)) – The input samples
y (ndarray of shape (n_samples, ), optional) – Ignored.
n_outliers (float, optional) –
The number of outlier samples expressed as a fraction of the inlier samples.
if float, the number of outliers are guaranteed but an error is raised if no cluster can satisfy the constraints. Lowering the
n_clusterparameter to allow for more samples per cluster.
n_clusters (int, optional) – The number of clusters.
random_state (int or RandomState) –
If int, random_state is the seed used by the random number generator
If RandomState instance, random_state is the random number generator
- If None, the random number generator is the RandomState instance used
by np.random.
- Returns:
x_outlier (ndarray of shape (n_inliers + n_outliers, n_timestep)) – The samples
y_outlier (ndarray of shape (n_inliers + n_outliers, )) – The inliers (labeled as 1) and outlier (labled as -1)
- wildboar.datasets.outlier.majority_outliers(x, y, *, n_outliers=0.05, random_state=None)[source]#
Labels the majority class as inliers
- Parameters:
x (ndarray of shape (n_samples, n_timestep)) – The input samples
y (ndarray of shape (n_samples, )) – The input labels.
n_outliers (float, optional) – The number of outlier samples expressed as a fraction of the inlier samples.
random_state (int or RandomState) –
If int, random_state is the seed used by the random number generator
If RandomState instance, random_state is the random number generator
- If None, the random number generator is the RandomState instance used
by np.random.
- Returns:
x_outlier (ndarray of shape (n_inliers + n_outliers, n_timestep)) – The samples
y_outlier (ndarray of shape (n_inliers + n_outliers, )) – The inliers (labeled as 1) and outlier (labled as -1)
- wildboar.datasets.outlier.minority_outliers(x, y, *, n_outliers=0.05, random_state=None)[source]#
Labels (a fraction of) the minority class as the outlier.
- Parameters:
x (ndarray of shape (n_samples, n_timestep)) – The input samples
y (ndarray of shape (n_samples, )) – The input labels.
n_outliers (float, optional) –
The number of outlier samples expressed as a fraction of the inlier samples.
if float, the number of outliers are guaranteed but an error is raised if the minority class has to few samples.
random_state (int or RandomState) –
If int, random_state is the seed used by the random number generator
If RandomState instance, random_state is the random number generator
- If None, the random number generator is the RandomState instance used
by np.random.
- Returns:
x_outlier (ndarray of shape (n_inliers + n_outliers, n_timestep)) – The samples
y_outlier (ndarray of shape (n_inliers + n_outliers, )) – The inliers (labeled as 1) and outlier (labled as -1)