***********************************
:py:mod:`wildboar.datasets.outlier`
***********************************

.. py:module:: wildboar.datasets.outlier

.. autoapi-nested-parse::

   Utilities for generating synthetic outlier datasets.

   See the :ref:`User Guide
   <guide-outlier-generation>` for more details and
   example uses.

   ..
       !! processed by numpydoc !!


Module Contents
---------------


Functions
---------

.. autoapisummary::

   wildboar.datasets.outlier.density_outliers
   wildboar.datasets.outlier.emmott_outliers
   wildboar.datasets.outlier.kmeans_outliers
   wildboar.datasets.outlier.majority_outliers
   wildboar.datasets.outlier.minority_outliers


.. py:function:: density_outliers(x, *, n_outliers=0.05, method='dbscan', eps=2.0, min_sample=5, metric='euclidean', max_eps=np.inf, random_state=None)

   
   Densitiy based outlier generation.

   Labels samples as outliers if a density cluster algorithm fail to assign
   them to a cluster.

   :Parameters:

       **x** : ndarray of shape (n_samples, n_timestep)
           The input samples.

       **n_outliers** : float, optional
           The number of outlier samples expressed as a fraction of the inlier samples.
           By default, all samples of the minority class is considered as outliers.

       **method** : {"dbscan", "optics"}, optional
           The density based clustering method.

       **eps** : float, optional
           The eps parameter, when `method="dbscan"`.

       **min_sample** : int, optional
           The `min_sample` parameter to the cluter method.

       **metric** : str, optional
           The `metric` parameter to the cluster method.

       **max_eps** : float, optional
           The `max_eps` parameter, when `method="optics"`.

       **random_state** : int or RandomState
           - If `int`, `random_state` is the seed used by the random number generator
           - If `RandomState` instance, `random_state` is the random number generator
           - If `None`, the random number generator is the `RandomState` instance used
               by `np.random`.

   :Returns:

       **x_outlier** : ndarray of shape (n_inliers + n_outliers, n_timestep)
           The samples.

       **y_outlier** : ndarray of shape (n_inliers + n_outliers, )
           The inliers (labeled as 1) and outlier (labled as -1).


   ..
       !! processed by numpydoc !!

.. py:function:: emmott_outliers(x, y, *, n_outliers=None, confusion_estimator=None, difficulty_estimator=None, transform='interval', difficulty='simplest', scale=None, variation='tight', random_state=None)

   
   Difficulty based outlier generation.

   Create a synthetic outlier detection dataset from a labeled classification
   dataset using the method described by Emmott et.al. (2013).

   The Emmott labeler can reliably label both binary and multiclass datasets. For
   binary datasets a random label is selected as the outlier class. For multiclass
   datasets a set of classes with maximal confusion (as measured by
   `confusion_estimator` is selected as outlier label. For each outlier sample the
   `difficulty_estimator` assigns a difficulty score which is digitized into ranges
   and selected according to the `difficulty` parameters. Finally a sample of
   approximately `n_outlier` is selected either maximally dispersed or tight.

   :Parameters:

       **x** : ndarray of shape (n_samples, n_timestep)
           The input samples.

       **y** : ndarray of shape (n_samples, )
           The input labels.

       **n_outliers** : float, optional
           The number of outlier samples expressed as a fraction of the inlier samples.
           
           - if float, the number of outliers are guaranteed but an error is raised
             if the the requested difficulty has to few samples or the labels selected
             for the outlier label has to few samples.

       **confusion_estimator** : object, optional
           Estimator of class confusion for datasets where `n_classes > 2`. Default to a
           random forest classifier.

       **difficulty_estimator** : object, optional
           Estimator for sample difficulty. The difficulty estimator must support
           `predict_proba`. Defaults to a kernel logistic regression model with
           a RBF-kernel.

       **transform** : 'interval' or Transform, optional
           Transform x before the confusion and difficulty estimator.
           
           - if None, no transformation is applied.
           - if 'interval', use the :class:`transform.IntervalTransform` with default
             parameters.
           - otherwise, use the supplied transform.

       **difficulty** : {'any', 'simplest', 'hardest'}, int or array-like, optional
           The difficulty of the outlier points quantized according to scale. The value
           should be in the range `[1, len(scale)]` with lower difficulty denoting
           simpler outliers. If an array is given, multiple difficulties can be
           included, e.g., `[1, 4]` would mix easy and difficult outliers.
           
           - if 'any' outliers are sampled from all scores.
           - if 'simplest' the simplest n_outliers are selected.
           - if 'hardest' the hardest n_outliers are selected.

       **scale** : int or array-like, optional
           The scale of quantized difficulty scores. Defaults to `[0, 0.16, 0.3, 0.5]`.
           Scores (which are probabilities in the range [0, 1]) are fit into the ranges
           using `np.digitize(difficulty, scale)`.
           
           - if int, use `scale` percentiles based in the difficulty scores.

       **variation** : {'tight', 'dispersed'}, optional
           Selection procedure for sampling outlier samples. If `difficulty="simplest"`
           or `difficulty="hardest"`, this parameter has no effect.
           
           - if 'tight' a pivot point is selected and the `n_outlier` closest samples
             are selected according to their euclidean distance.
           - if 'dispersed' `n_outlier` points are selected according to a facility
             location algorithm such that they are distributed among the outliers.

       **random_state** : int or RandomState
           - If `int`, `random_state` is the seed used by the random number generator
           - If `RandomState` instance, `random_state` is the random number generator
           - If `None`, the random number generator is the `RandomState` instance used
               by `np.random`.

   :Returns:

       **x_outlier** : ndarray of shape (n_inliers + n_outliers, n_timestep)
           The samples.

       **y_outlier** : ndarray of shape (n_inliers + n_outliers, )
           The inliers (labeled as 1) and outlier (labled as -1).


   .. warning::

       n_outliers
           The number of outliers returned is dependent on the difficulty setting and the
           available number of samples of the minority class. If the minority class does
           not contain sufficient number of samples of the desired difficulty, fewer than
           n_outliers may be returned.


   .. rubric:: Notes

   - For multiclass datasets the Emmott labeler require the package `networkx`

   The difficulty parameters 'simplest' and 'hardest' are not described by
   Emmott et.al. (2013)

   .. rubric:: References

   Emmott, A. F., Das, S., Dietterich, T., Fern, A., & Wong, W. K. (2013).
       Systematic construction of anomaly detection benchmarks from real data.
       In Proceedings of the ACM SIGKDD workshop on outlier detection and description
       (pp. 16-21).

   .. only:: latex

      
   ..
       !! processed by numpydoc !!

.. py:function:: kmeans_outliers(x, *, n_outliers=0.05, n_clusters=5, random_state=None)

   
   K-mean outlier generation.

   Label the samples of the cluster farthers from the other clusters as
   outliers.

   :Parameters:

       **x** : ndarray of shape (n_samples, n_timestep)
           The input samples.

       **n_outliers** : float, optional
           The number of outlier samples expressed as a fraction of the inlier samples.
           
           - if float, the number of outliers are guaranteed but an error is raised
             if no cluster can satisfy the constraints. Lowering the `n_cluster`
             parameter to allow for more samples per cluster.

       **n_clusters** : int, optional
           The number of clusters.

       **random_state** : int or RandomState
           - If `int`, `random_state` is the seed used by the random number generator
           - If `RandomState` instance, `random_state` is the random number generator
           - If `None`, the random number generator is the `RandomState` instance used
               by `np.random`.

   :Returns:

       **x_outlier** : ndarray of shape (n_inliers + n_outliers, n_timestep)
           The samples.

       **y_outlier** : ndarray of shape (n_inliers + n_outliers, )
           The inliers (labeled as 1) and outlier (labled as -1).


   ..
       !! processed by numpydoc !!

.. py:function:: majority_outliers(x, y, *, n_outliers=0.05, random_state=None)

   
   Label the majority class as inliers.


   :Parameters:

       **x** : ndarray of shape (n_samples, n_timestep)
           The input samples.

       **y** : ndarray of shape (n_samples, )
           The input labels.

       **n_outliers** : float, optional
           The number of outlier samples expressed as a fraction of the inlier samples.

       **random_state** : int or RandomState
           - If `int`, `random_state` is the seed used by the random number generator
           - If `RandomState` instance, `random_state` is the random number generator
           - If `None`, the random number generator is the `RandomState` instance used
               by `np.random`.

   :Returns:

       **x_outlier** : ndarray of shape (n_inliers + n_outliers, n_timestep)
           The samples.

       **y_outlier** : ndarray of shape (n_inliers + n_outliers, )
           The inliers (labeled as 1) and outlier (labled as -1).


   ..
       !! processed by numpydoc !!

.. py:function:: minority_outliers(x, y, *, n_outliers=0.05, random_state=None)

   
   Label (a fraction of) the minority class as the outlier.


   :Parameters:

       **x** : ndarray of shape (n_samples, n_timestep)
           The input samples.

       **y** : ndarray of shape (n_samples, )
           The input labels.

       **n_outliers** : float, optional
           The number of outlier samples expressed as a fraction of the inlier samples.
           
           - if float, the number of outliers are guaranteed but an error is raised
             if the minority class has to few samples.

       **random_state** : int or RandomState
           - If `int`, `random_state` is the seed used by the random number generator
           - If `RandomState` instance, `random_state` is the random number generator
           - If `None`, the random number generator is the `RandomState` instance used
               by `np.random`.

   :Returns:

       **x_outlier** : ndarray of shape (n_inliers + n_outliers, n_timestep)
           The samples.

       **y_outlier** : ndarray of shape (n_inliers + n_outliers, )
           The inliers (labeled as 1) and outlier (labled as -1).


   ..
       !! processed by numpydoc !!