wildboar.tree#

Tree-based estimators for classification and regression.

Classes#

ExtraShapeletTreeClassifier

An extra shapelet tree classifier.

ExtraShapeletTreeRegressor

An extra shapelet tree regressor.

IntervalTreeClassifier

An interval based tree classifier.

IntervalTreeRegressor

An interval based tree regressor.

PivotTreeClassifier

A tree classifier that uses pivot time series.

ProximityTreeClassifier

A classifier that uses a k-branching tree based on pivot-time series.

RocketTreeClassifier

A tree classifier that uses random convolutions as features.

RocketTreeRegressor

A tree regressor that uses random convolutions as features.

ShapeletTreeClassifier

A shapelet tree classifier.

ShapeletTreeRegressor

A shapelet tree regressor.

Functions#

plot_tree(clf, *[, ax, bbox_args, arrow_args, ...])

Plot a tree


class wildboar.tree.ExtraShapeletTreeClassifier(*, n_shapelets=1, max_depth=None, min_samples_leaf=1, min_impurity_decrease=0.0, min_samples_split=2, min_shapelet_size=0.0, max_shapelet_size=1.0, coverage_probability=None, variability=1, metric='euclidean', metric_params=None, criterion='entropy', class_weight=None, random_state=None)[source]#

An extra shapelet tree classifier.

Extra shapelet trees are constructed by sampling a distance threshold uniformly in the range [min(dist), max(dist)].

Parameters:
n_shapeletsint, optional

The number of shapelets to sample at each node.

max_depthint, optional

The maximum depth of the tree. If None the tree is expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leafint, optional

The minimum number of samples in a leaf.

min_impurity_decreasefloat, optional

A split will be introduced only if the impurity decrease is larger than or equal to this value.

min_samples_splitint, optional

The minimum number of samples to split an internal node.

min_shapelet_sizefloat, optional

The minimum length of a sampled shapelet expressed as a fraction, computed as min(ceil(X.shape[-1] * min_shapelet_size), 2).

max_shapelet_sizefloat, optional

The maximum length of a sampled shapelet, expressed as a fraction, computed as ceil(X.shape[-1] * max_shapelet_size).

coverage_probabilityfloat, optional

The probability that a time step is covered by a shapelet, in the range 0 < coverage_probability <= 1.

  • For larger coverage_probability, we get larger shapelets.

  • For smaller coverage_probability, we get shorter shapelets.

variabilityfloat, optional

Controls the shape of the Beta distribution used to sample shapelets. Defaults to 1.

  • Higher variability creates more uniform intervals.

  • Lower variability creates more variable intervals sizes.

metric{“euclidean”, “scaled_euclidean”, “dtw”, “scaled_dtw”}, optional

Distance metric used to identify the best shapelet.

metric_paramsdict, optional

Parameters for the distance measure.

criterion{“entropy”, “gini”}, optional

The criterion used to evaluate the utility of a split.

class_weightdict or “balanced”, optional

Weights associated with the labels

  • if dict, weights on the form {label: weight}

  • if “balanced” each class weight inversely proportional to the class

    frequency

  • if None, each class has equal weight.

random_stateint or RandomState, optional
  • If int, random_state is the seed used by the random number generator;

  • If RandomState instance, random_state is the random number generator;

  • If None, the random number generator is the RandomState instance used

    by np.random.

Attributes:
tree_Tree

The tree representation

apply(x, check_input=True)[source]#

Return the index of the leaf that each sample is predicted by.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
ndarray of shape (n_samples, )

For every sample, return the index of the leaf that the sample ends up in. The index is in the range [0; node_count].

Examples

Get the leaf probability distribution of a prediction:

>>> from wildboar.datasets import load_gun_point
>>> from wildboar.tree import ShapeletTreeClassifier
>>> X, y = load_gun_point()
>>> tree = ShapeletTreeClassifier()
>>> tree.fit(X, y)
>>> leaves = tree.apply(X)
>>> tree.tree_.value.take(leaves, axis=0)
array([[0., 1.],
       [0., 1.],
       [1., 0.]])

This is equvivalent to using tree.predict_proba.

decision_path(x, check_input=True)[source]#

Compute the decision path of the tree.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
sparse matrix of shape (n_samples, n_nodes)

An indicator array where each nonzero values indicate that the sample traverses a node.

fit(x, y, sample_weight=None, check_input=True)[source]#

Fit a classification tree.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The training time series.

yarray-like of shape (n_samples,)

The target values.

sample_weightarray-like of shape (n_samples,), optional

If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

check_inputbool, optional

Allow to bypass several input checks.

Returns:
self

This instance.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(x, check_input=True)[source]#

Predict the regression of the input samples x.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples,)

The predicted classes.

predict_proba(x, check_input=True)[source]#

Predict class probabilities of the input samples X.

The predicted class probability is the fraction of samples of the same class in a leaf.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples, n_classes)

The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

score(X, y, sample_weight=None)[source]#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

Mean accuracy of self.predict(X) w.r.t. y.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

class wildboar.tree.ExtraShapeletTreeRegressor(*, n_shapelets=1, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_impurity_decrease=0.0, min_shapelet_size=0.0, max_shapelet_size=1.0, coverage_probability=None, variability=1, metric='euclidean', metric_params=None, criterion='squared_error', random_state=None)[source]#

An extra shapelet tree regressor.

Extra shapelet trees are constructed by sampling a distance threshold uniformly in the range [min(dist), max(dist)].

Parameters:
n_shapeletsint, optional

The number of shapelets to sample at each node.

max_depthint, optional

The maximum depth of the tree. If None the tree is expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, optional

The minimum number of samples to split an internal node.

min_samples_leafint, optional

The minimum number of samples in a leaf.

criterion{“squared_error”}, optional

The criterion used to evaluate the utility of a split.

Deprecated since version 1.1: Criterion “mse” was deprecated in v1.1 and removed in version 1.2.

min_impurity_decreasefloat, optional

A split will be introduced only if the impurity decrease is larger than or equal to this value.

min_shapelet_sizefloat, optional

The minimum length of a sampled shapelet expressed as a fraction, computed as min(ceil(X.shape[-1] * min_shapelet_size), 2).

max_shapelet_sizefloat, optional

The maximum length of a sampled shapelet, expressed as a fraction, computed as ceil(X.shape[-1] * max_shapelet_size).

coverage_probabilityfloat, optional

The probability that a time step is covered by a shapelet, in the range 0 < coverage_probability <= 1.

  • For larger coverage_probability, we get larger shapelets.

  • For smaller coverage_probability, we get shorter shapelets.

variabilityfloat, optional

Controls the shape of the Beta distribution used to sample shapelets. Defaults to 1.

  • Higher variability creates more uniform intervals.

  • Lower variability creates more variable intervals sizes.

metric{‘euclidean’, ‘scaled_euclidean’, ‘scaled_dtw’}, optional

Distance metric used to identify the best shapelet.

metric_paramsdict, optional

Parameters for the distance measure.

random_stateint or RandomState
  • If int, random_state is the seed used by the random number generator;

  • If RandomState instance, random_state is the random number generator;

  • If None, the random number generator is the RandomState instance used

    by np.random.

Attributes:
tree_Tree

The internal tree representation

apply(x, check_input=True)[source]#

Return the index of the leaf that each sample is predicted by.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
ndarray of shape (n_samples, )

For every sample, return the index of the leaf that the sample ends up in. The index is in the range [0; node_count].

Examples

Get the leaf probability distribution of a prediction:

>>> from wildboar.datasets import load_gun_point
>>> from wildboar.tree import ShapeletTreeClassifier
>>> X, y = load_gun_point()
>>> tree = ShapeletTreeClassifier()
>>> tree.fit(X, y)
>>> leaves = tree.apply(X)
>>> tree.tree_.value.take(leaves, axis=0)
array([[0., 1.],
       [0., 1.],
       [1., 0.]])

This is equvivalent to using tree.predict_proba.

decision_path(x, check_input=True)[source]#

Compute the decision path of the tree.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
sparse matrix of shape (n_samples, n_nodes)

An indicator array where each nonzero values indicate that the sample traverses a node.

fit(x, y, sample_weight=None, check_input=True)[source]#

Fit the estimator.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The training time series.

yarray-like of shape (n_samples,)

Target values as floating point values.

sample_weightarray-like of shape (n_samples,), optional

If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

check_inputbool, optional

Allow to bypass several input checks.

Returns:
self

This object.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(x, check_input=True)[source]#

Predict the value of x.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples,)

The predicted classes.

score(X, y, sample_weight=None)[source]#

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True values for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

\(R^2\) of self.predict(X) w.r.t. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score. This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

class wildboar.tree.IntervalTreeClassifier(n_intervals='sqrt', *, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_impurity_decrease=0.0, criterion='entropy', intervals='fixed', sample_size=None, min_size=0.0, max_size=1.0, coverage_probability=None, variability=1, summarizer='mean_var_slope', class_weight=None, random_state=None)[source]#

An interval based tree classifier.

Parameters:
n_intervals{“log”, “sqrt”}, int or float, optional

The number of intervals to partition the time series into.

  • if “log”, the number of intervals is log2(n_timestep).

  • if “sqrt”, the number of intervals is sqrt(n_timestep).

  • if int, the number of intervals is n_intervals.

  • if float, the number of intervals is n_intervals * n_timestep, with

    0 < n_intervals < 1.

max_depthint, optional

The maximum depth of the tree. If None the tree is expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, optional

The minimum number of samples to split an internal node.

min_samples_leafint, optional

The minimum number of samples in a leaf.

min_impurity_decreasefloat, optional

A split will be introduced only if the impurity decrease is larger than or equal to this value.

criterion{“entropy”, “gini”}, optional

The criterion used to evaluate the utility of a split.

intervals{“fixed”, “sample”, “random”}, optional
  • if “fixed”, n_intervals non-overlapping intervals.

  • if “sample”, n_intervals * sample_size non-overlapping intervals.

  • if “random”, n_intervals possibly overlapping intervals of randomly

    sampled in [min_size * n_timestep, max_size * n_timestep].

sample_sizefloat, optional

The fraction of intervals to sample at each node. Ignored unless intervals=”sample”.

min_sizefloat, optional

The minimum interval size if intervals=”random”. Ignored if coverage_probability is set.

max_sizefloat, optional

The maximum interval size if intervals=”random”. Ignored if coverage_probability is set.

coverage_probabilityfloat, optional

The probability that a time step is covered by an interval, in the range 0 < coverage_probability <= 1.

  • For larger coverage_probability, we get larger intervals.

  • For smaller coverage_probability, we get shorter intervals.

variabilityfloat, optional

Controls the shape of the Beta distribution used to sample shapelets. Defaults to 1.

  • Higher variability creates more uniform intervals.

  • Lower variability creates more variable intervals sizes.

summarizerstr or list, optional

The method to summarize each interval.

  • if str, the summarizer is determined by _SUMMARIZERS.keys().

  • if list, the summarizer is a list of functions f(x) -> float, where x is a numpy array.

The default summarizer summarizes each interval as its mean, variance and slope.

class_weightdict or “balanced”, optional

Weights associated with the labels - if dict, weights on the form {label: weight} - if “balanced” each class weight inversely proportional to the class

frequency

  • if None, each class has equal weight.

random_stateint or RandomState, optional
  • If int, random_state is the seed used by the random number generator

  • If RandomState instance, random_state is the random number generator

  • If None, the random number generator is the RandomState instance used

    by np.random.

Attributes:
tree_Tree

The internal tree structure.

apply(x, check_input=True)[source]#

Return the index of the leaf that each sample is predicted by.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
ndarray of shape (n_samples, )

For every sample, return the index of the leaf that the sample ends up in. The index is in the range [0; node_count].

Examples

Get the leaf probability distribution of a prediction:

>>> from wildboar.datasets import load_gun_point
>>> from wildboar.tree import ShapeletTreeClassifier
>>> X, y = load_gun_point()
>>> tree = ShapeletTreeClassifier()
>>> tree.fit(X, y)
>>> leaves = tree.apply(X)
>>> tree.tree_.value.take(leaves, axis=0)
array([[0., 1.],
       [0., 1.],
       [1., 0.]])

This is equvivalent to using tree.predict_proba.

decision_path(x, check_input=True)[source]#

Compute the decision path of the tree.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
sparse matrix of shape (n_samples, n_nodes)

An indicator array where each nonzero values indicate that the sample traverses a node.

fit(x, y, sample_weight=None, check_input=True)[source]#

Fit a classification tree.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The training time series.

yarray-like of shape (n_samples,)

The target values.

sample_weightarray-like of shape (n_samples,), optional

If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

check_inputbool, optional

Allow to bypass several input checks.

Returns:
self

This instance.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(x, check_input=True)[source]#

Predict the regression of the input samples x.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples,)

The predicted classes.

predict_proba(x, check_input=True)[source]#

Predict class probabilities of the input samples X.

The predicted class probability is the fraction of samples of the same class in a leaf.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples, n_classes)

The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

score(X, y, sample_weight=None)[source]#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

Mean accuracy of self.predict(X) w.r.t. y.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

class wildboar.tree.IntervalTreeRegressor(n_intervals='sqrt', *, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_impurity_decrease=0.0, criterion='squared_error', intervals='fixed', sample_size=None, min_size=0.0, max_size=1.0, coverage_probability=None, variability=1, summarizer='mean_var_slope', random_state=None)[source]#

An interval based tree regressor.

Parameters:
n_intervals{“log”, “sqrt”}, int or float, optional

The number of intervals to partition the time series into.

  • if “log”, the number of intervals is log2(n_timestep).

  • if “sqrt”, the number of intervals is sqrt(n_timestep).

  • if int, the number of intervals is n_intervals.

  • if float, the number of intervals is n_intervals * n_timestep, with

    0 < n_intervals < 1.

max_depthint, optional

The maximum depth of the tree. If None the tree is expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, optional

The minimum number of samples to split an internal node.

min_samples_leafint, optional

The minimum number of samples in a leaf.

min_impurity_decreasefloat, optional

A split will be introduced only if the impurity decrease is larger than or equal to this value.

criterion{“entropy”, “gini”}, optional

The criterion used to evaluate the utility of a split.

intervals{“fixed”, “sample”, “random”}, optional
  • if “fixed”, n_intervals non-overlapping intervals.

  • if “sample”, n_intervals * sample_size non-overlapping intervals.

  • if “random”, n_intervals possibly overlapping intervals of randomly

    sampled in [min_size * n_timestep, max_size * n_timestep].

sample_sizefloat, optional

The fraction of intervals to sample at each node. Ignored unless intervals=”sample”.

min_sizefloat, optional

The minimum interval size if intervals=”random”. Ignored if coverage_probability is set.

max_sizefloat, optional

The maximum interval size if intervals=”random”. Ignored if coverage_probability is set.

coverage_probabilityfloat, optional

The probability that a time step is covered by an interval, in the range 0 < coverage_probability <= 1.

  • For larger coverage_probability, we get larger intervals.

  • For smaller coverage_probability, we get shorter intervals.

variabilityfloat, optional

Controls the shape of the Beta distribution used to sample shapelets. Defaults to 1.

  • Higher variability creates more uniform intervals.

  • Lower variability creates more variable intervals sizes.

summarizerstr or list, optional

The method to summarize each interval.

  • if str, the summarizer is determined by _SUMMARIZERS.keys().

  • if list, the summarizer is a list of functions f(x) -> float, where x is a numpy array.

The default summarizer summarizes each interval as its mean, variance and slope.

random_stateint or RandomState, optional
  • If int, random_state is the seed used by the random number generator

  • If RandomState instance, random_state is the random number generator

  • If None, the random number generator is the RandomState instance used

    by np.random.

Attributes:
tree_Tree

The internal tree structure.

apply(x, check_input=True)[source]#

Return the index of the leaf that each sample is predicted by.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
ndarray of shape (n_samples, )

For every sample, return the index of the leaf that the sample ends up in. The index is in the range [0; node_count].

Examples

Get the leaf probability distribution of a prediction:

>>> from wildboar.datasets import load_gun_point
>>> from wildboar.tree import ShapeletTreeClassifier
>>> X, y = load_gun_point()
>>> tree = ShapeletTreeClassifier()
>>> tree.fit(X, y)
>>> leaves = tree.apply(X)
>>> tree.tree_.value.take(leaves, axis=0)
array([[0., 1.],
       [0., 1.],
       [1., 0.]])

This is equvivalent to using tree.predict_proba.

decision_path(x, check_input=True)[source]#

Compute the decision path of the tree.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
sparse matrix of shape (n_samples, n_nodes)

An indicator array where each nonzero values indicate that the sample traverses a node.

fit(x, y, sample_weight=None, check_input=True)[source]#

Fit the estimator.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The training time series.

yarray-like of shape (n_samples,)

Target values as floating point values.

sample_weightarray-like of shape (n_samples,), optional

If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

check_inputbool, optional

Allow to bypass several input checks.

Returns:
self

This object.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(x, check_input=True)[source]#

Predict the value of x.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples,)

The predicted classes.

score(X, y, sample_weight=None)[source]#

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True values for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

\(R^2\) of self.predict(X) w.r.t. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score. This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

class wildboar.tree.PivotTreeClassifier(n_pivot='sqrt', *, metrics='all', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_impurity_decrease=0.0, impurity_equality_tolerance=None, criterion='entropy', class_weight=None, random_state=None)[source]#

A tree classifier that uses pivot time series.

Parameters:
n_pivotstr or int, optional

The number of pivot time series to sample at each node.

metricsstr, optional

The metrics to sample from. Currently, we only support “all”.

max_depthint, optional

The maximum depth of the tree. If None the tree is expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, optional

The minimum number of samples to split an internal node.

min_samples_leafint, optional

The minimum number of samples in a leaf.

min_impurity_decreasefloat, optional

A split will be introduced only if the impurity decrease is larger than or equal to this value.

impurity_equality_tolerancefloat, optional

Tolerance for considering two impurities as equal. If the impurity decrease is the same, we consider the split that maximizes the gap between the sum of distances.

  • If None, we never consider the separation gap.

Added in version 1.3.

criterion{“entropy”, “gini”}, optional

The criterion used to evaluate the utility of a split.

class_weightdict or “balanced”, optional

Weights associated with the labels.

  • if dict, weights on the form {label: weight}.

  • if “balanced” each class weight inversely proportional to the class

    frequency.

  • if None, each class has equal weight.

random_stateint or RandomState
  • If int, random_state is the seed used by the random number generator

  • If RandomState instance, random_state is the random number generator

  • If None, the random number generator is the RandomState instance used

    by np.random.

Attributes:
tree_Tree

The internal tree representation

apply(x, check_input=True)[source]#

Return the index of the leaf that each sample is predicted by.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
ndarray of shape (n_samples, )

For every sample, return the index of the leaf that the sample ends up in. The index is in the range [0; node_count].

Examples

Get the leaf probability distribution of a prediction:

>>> from wildboar.datasets import load_gun_point
>>> from wildboar.tree import ShapeletTreeClassifier
>>> X, y = load_gun_point()
>>> tree = ShapeletTreeClassifier()
>>> tree.fit(X, y)
>>> leaves = tree.apply(X)
>>> tree.tree_.value.take(leaves, axis=0)
array([[0., 1.],
       [0., 1.],
       [1., 0.]])

This is equvivalent to using tree.predict_proba.

decision_path(x, check_input=True)[source]#

Compute the decision path of the tree.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
sparse matrix of shape (n_samples, n_nodes)

An indicator array where each nonzero values indicate that the sample traverses a node.

fit(x, y, sample_weight=None, check_input=True)[source]#

Fit a classification tree.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The training time series.

yarray-like of shape (n_samples,)

The target values.

sample_weightarray-like of shape (n_samples,), optional

If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

check_inputbool, optional

Allow to bypass several input checks.

Returns:
self

This instance.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(x, check_input=True)[source]#

Predict the regression of the input samples x.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples,)

The predicted classes.

predict_proba(x, check_input=True)[source]#

Predict class probabilities of the input samples X.

The predicted class probability is the fraction of samples of the same class in a leaf.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples, n_classes)

The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

score(X, y, sample_weight=None)[source]#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

Mean accuracy of self.predict(X) w.r.t. y.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

class wildboar.tree.ProximityTreeClassifier(n_pivot=1, *, criterion='entropy', pivot_sample='label', metric_sample='weighted', metric='auto', metric_params=None, metric_factories=None, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_impurity_decrease=0.0, class_weight=None, random_state=None)[source]#

A classifier that uses a k-branching tree based on pivot-time series.

Parameters:
n_pivotint, optional

The number of pivots to sample at each node.

criterion{“entropy”, “gini”}, optional

The impurity criterion.

pivot_sample{“label”, “uniform”}, optional

The pivot sampling method.

metric_sample{“uniform”, “weighted”}, optional

The metric sampling method.

metric{“auto”}, str or list, optional

The distance metrics. By default, we use the parameterization suggested by Lucas et.al (2019).

  • If “auto”, use the default metric specification, suggested by (Lucas et. al, 2020).

  • If str, use a single metric or default metric specification.

  • If list, custom metric specification can be given as a list of tuples, where the first element of the tuple is a metric name and the second element a dictionary with a parameter grid specification. A parameter grid specification is a dict with two mandatory and one optional key-value pairs defining the lower and upper bound on the values as well as the number of values in the grid. For example, to specifiy a grid over the argument ‘r’ with 10 values in the range 0 to 1, we would give the following specification: dict(min_r=0, max_r=1, num_r=10).

Read more about the metrics and their parameters in the User guide.

metric_paramsdict, optional

Parameters for the distance measure. Ignored unless metric is a string.

Read more about the parameters in the User guide.

metric_factoriesdict, optional

A metric specification.

Deprecated since version 1.2: Use the combination of metric and metric params.

max_depthint, optional

The maximum tree depth.

min_samples_splitint, optional

The minimum number of samples to consider a split.

min_samples_leafint, optional

The minimum number of samples in a leaf.

min_impurity_decreasefloat, optional

The minimum impurity decrease to build a sub-tree.

class_weightdict or “balanced”, optional

Weights associated with the labels.

  • if dict, weights on the form {label: weight}.

  • if “balanced” each class weight inversely proportional to the class

    frequency.

  • if None, each class has equal weight.

random_stateint or RandomState
  • If int, random_state is the seed used by the random number generator

  • If RandomState instance, random_state is the random number generator

  • If None, the random number generator is the RandomState instance used

    by np.random.

References

Lucas, Benjamin, Ahmed Shifaz, Charlotte Pelletier, Lachlan O’Neill, Nayyar Zaidi, Bart Goethals, François Petitjean, and Geoffrey I. Webb. (2019)

Proximity forest: an effective and scalable distance-based classifier for time series. Data Mining and Knowledge Discovery

Examples

Fit a single proximity tree, with dynamic time warping and move-split-merge metrics.

>>> from wildboar.datasets import load_dataset
>>> from wildboar.tree import ProximityTreeClassifier
>>> x, y = load_dataset("GunPoint")
>>> f = ProximityTreeClassifier(
...     n_pivot=10,
...     metrics=[
...         ("dtw", {"min_r": 0.1, "max_r": 0.25}),
...         ("msm", {"min_c": 0.1, "max_c": 100, "num_c": 20})
...     ],
...     criterion="gini"
... )
>>> f.fit(x, y)
apply(x, check_input=True)[source]#

Return the index of the leaf that each sample is predicted by.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
ndarray of shape (n_samples, )

For every sample, return the index of the leaf that the sample ends up in. The index is in the range [0; node_count].

Examples

Get the leaf probability distribution of a prediction:

>>> from wildboar.datasets import load_gun_point
>>> from wildboar.tree import ShapeletTreeClassifier
>>> X, y = load_gun_point()
>>> tree = ShapeletTreeClassifier()
>>> tree.fit(X, y)
>>> leaves = tree.apply(X)
>>> tree.tree_.value.take(leaves, axis=0)
array([[0., 1.],
       [0., 1.],
       [1., 0.]])

This is equvivalent to using tree.predict_proba.

decision_path(x, check_input=True)[source]#

Compute the decision path of the tree.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
sparse matrix of shape (n_samples, n_nodes)

An indicator array where each nonzero values indicate that the sample traverses a node.

fit(x, y, sample_weight=None, check_input=True)[source]#

Fit a classification tree.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The training time series.

yarray-like of shape (n_samples,)

The target values.

sample_weightarray-like of shape (n_samples,), optional

If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

check_inputbool, optional

Allow to bypass several input checks.

Returns:
self

This instance.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(x, check_input=True)[source]#

Predict the regression of the input samples x.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples,)

The predicted classes.

predict_proba(x, check_input=True)[source]#

Predict class probabilities of the input samples X.

The predicted class probability is the fraction of samples of the same class in a leaf.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples, n_classes)

The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

score(X, y, sample_weight=None)[source]#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

Mean accuracy of self.predict(X) w.r.t. y.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

class wildboar.tree.RocketTreeClassifier(n_kernels=10, *, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_impurity_decrease=0.0, criterion='entropy', sampling='normal', sampling_params=None, kernel_size=None, min_size=None, max_size=None, bias_prob=1.0, normalize_prob=1.0, padding_prob=0.5, class_weight=None, random_state=None)[source]#

A tree classifier that uses random convolutions as features.

Attributes:
tree_Tree

The internal tree representation.

apply(x, check_input=True)[source]#

Return the index of the leaf that each sample is predicted by.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
ndarray of shape (n_samples, )

For every sample, return the index of the leaf that the sample ends up in. The index is in the range [0; node_count].

Examples

Get the leaf probability distribution of a prediction:

>>> from wildboar.datasets import load_gun_point
>>> from wildboar.tree import ShapeletTreeClassifier
>>> X, y = load_gun_point()
>>> tree = ShapeletTreeClassifier()
>>> tree.fit(X, y)
>>> leaves = tree.apply(X)
>>> tree.tree_.value.take(leaves, axis=0)
array([[0., 1.],
       [0., 1.],
       [1., 0.]])

This is equvivalent to using tree.predict_proba.

decision_path(x, check_input=True)[source]#

Compute the decision path of the tree.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
sparse matrix of shape (n_samples, n_nodes)

An indicator array where each nonzero values indicate that the sample traverses a node.

fit(x, y, sample_weight=None, check_input=True)[source]#

Fit a classification tree.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The training time series.

yarray-like of shape (n_samples,)

The target values.

sample_weightarray-like of shape (n_samples,), optional

If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

check_inputbool, optional

Allow to bypass several input checks.

Returns:
self

This instance.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(x, check_input=True)[source]#

Predict the regression of the input samples x.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples,)

The predicted classes.

predict_proba(x, check_input=True)[source]#

Predict class probabilities of the input samples X.

The predicted class probability is the fraction of samples of the same class in a leaf.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples, n_classes)

The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

score(X, y, sample_weight=None)[source]#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

Mean accuracy of self.predict(X) w.r.t. y.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

class wildboar.tree.RocketTreeRegressor(n_kernels=10, *, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_impurity_decrease=0.0, criterion='squared_error', sampling='normal', sampling_params=None, kernel_size=None, bias_prob=1.0, normalize_prob=1.0, padding_prob=0.5, random_state=None)[source]#

A tree regressor that uses random convolutions as features.

Attributes:
tree_Tree

The internal tree representation.

apply(x, check_input=True)[source]#

Return the index of the leaf that each sample is predicted by.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
ndarray of shape (n_samples, )

For every sample, return the index of the leaf that the sample ends up in. The index is in the range [0; node_count].

Examples

Get the leaf probability distribution of a prediction:

>>> from wildboar.datasets import load_gun_point
>>> from wildboar.tree import ShapeletTreeClassifier
>>> X, y = load_gun_point()
>>> tree = ShapeletTreeClassifier()
>>> tree.fit(X, y)
>>> leaves = tree.apply(X)
>>> tree.tree_.value.take(leaves, axis=0)
array([[0., 1.],
       [0., 1.],
       [1., 0.]])

This is equvivalent to using tree.predict_proba.

decision_path(x, check_input=True)[source]#

Compute the decision path of the tree.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
sparse matrix of shape (n_samples, n_nodes)

An indicator array where each nonzero values indicate that the sample traverses a node.

fit(x, y, sample_weight=None, check_input=True)[source]#

Fit the estimator.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The training time series.

yarray-like of shape (n_samples,)

Target values as floating point values.

sample_weightarray-like of shape (n_samples,), optional

If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

check_inputbool, optional

Allow to bypass several input checks.

Returns:
self

This object.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(x, check_input=True)[source]#

Predict the value of x.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples,)

The predicted classes.

score(X, y, sample_weight=None)[source]#

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True values for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

\(R^2\) of self.predict(X) w.r.t. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score. This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

class wildboar.tree.ShapeletTreeClassifier(*, n_shapelets='log2', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_impurity_decrease=0.0, impurity_equality_tolerance=None, strategy='warn', shapelet_size=0.1, sample_size=1.0, min_shapelet_size=0.0, max_shapelet_size=1.0, coverage_probability=None, variability=1, alpha=None, metric='euclidean', metric_params=None, criterion='entropy', class_weight=None, random_state=None)[source]#

A shapelet tree classifier.

Parameters:
n_shapeletsint or {“log2”, “sqrt”, “auto”}, optional

The number of shapelets in the resulting transform.

  • if, “auto” the number of shapelets depend on the value of strategy. For “best” the number is 1; and for “random” it is 1000.

  • if, “log2”, the number of shaplets is the log2 of the total possible number of shapelets.

  • if, “sqrt”, the number of shaplets is the square root of the total possible number of shapelets.

max_depthint, optional

The maximum depth of the tree. If None the tree is expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, optional

The minimum number of samples to split an internal node.

min_samples_leafint, optional

The minimum number of samples in a leaf.

min_impurity_decreasefloat, optional

A split will be introduced only if the impurity decrease is larger than or equal to this value.

impurity_equality_tolerancefloat, optional

Tolerance for considering two impurities as equal. If the impurity decrease is the same, we consider the split that maximizes the gap between the sum of distances.

  • If None, we never consider the separation gap.

Added in version 1.3.

strategy{“best”, “random”}, optional

The strategy for selecting shapelets.

  • If “random”, n_shapelets shapelets are randomly selected in the range defined by min_shapelet_size and max_shapelet_size

  • If “best”, n_shapelets shapelets are selected per input sample of the size determined by shapelet_size.

Added in version 1.3: Add support for the “best” strategy. The default will change to “best” in 1.4.

shapelet_sizeint, float or array-like, optional

The shapelet size if strategy=”best”.

  • If int, the exact shapelet size.

  • If float, a fraction of the number of input timestep.

  • If array-like, a list of float or int.

Added in version 1.3.

sample_sizefloat, optional

The size of the sample to determine the shapelets, if shapelet_size=”best”.

Added in version 1.3.

min_shapelet_sizefloat, optional

The minimum length of a sampled shapelet expressed as a fraction, computed as min(ceil(X.shape[-1] * min_shapelet_size), 2).

max_shapelet_sizefloat, optional

The maximum length of a sampled shapelet, expressed as a fraction, computed as ceil(X.shape[-1] * max_shapelet_size).

coverage_probabilityfloat, optional

The probability that a time step is covered by a shapelet, in the range 0 < coverage_probability <= 1.

  • For larger coverage_probability, we get larger shapelets.

  • For smaller coverage_probability, we get shorter shapelets.

variabilityfloat, optional

Controls the shape of the Beta distribution used to sample shapelets. Defaults to 1.

  • Higher variability creates more uniform intervals.

  • Lower variability creates more variable intervals sizes.

alphafloat, optional

Dynamically decrease the number of sampled shapelets at each node according to the current depth.

  • if \(alpha < 0\), the number of sampled shapelets decrease from n_shapelets towards 1 with increased depth.

  • if \(alpha > 0\), the number of sampled shapelets increase from 1 towards n_shapelets with increased depth.

  • if None, the number of sampled shapelets are the same independent of depth.

metricstr or list, optional
  • If str, the distance metric used to identify the best

    shapelet.

  • If list, multiple metrics specified as a list of

    tuples, where the first element of the tuple is a metric name and the second element a dictionary with a parameter grid specification. A parameter grid specification is a dict with two mandatory and one optional key-value pairs defining the lower and upper bound on the values and number of values in the grid. For example, to specify a grid over the argument r with 10 values in the range 0 to 1, we would give the following specification: dict(min_r=0, max_r=1, num_r=10).

    Read more about metric specifications in the User guide.

Changed in version 1.2: Added support for multi-metric shapelet transform

metric_paramsdict, optional

Parameters for the distance measure. Ignored unless metric is a string.

Read more about the parameters in the User guide.

criterion{“entropy”, “gini”}, optional

The criterion used to evaluate the utility of a split.

class_weightdict or “balanced”, optional

Weights associated with the labels

  • if dict, weights on the form {label: weight}

  • if “balanced” each class weight inversely proportional to the class

    frequency

  • if None, each class has equal weight.

random_stateint or RandomState
  • If int, random_state is the seed used by the random number generator;

  • If RandomState instance, random_state is the random number generator;

  • If None, the random number generator is the RandomState instance used

    by np.random.

Attributes:
tree_Tree

The tree data structure used internally

classes_ndarray of shape (n_classes,)

The class labels

n_classes_int

The number of class labels

See also

ShapeletTreeRegressor

A shapelet tree regressor.

ExtraShapeletTreeClassifier

An extra random shapelet tree classifier.

Notes

When strategy is set to “best”, the shapelet tree is constructed by selecting the top n_shapelets per sample. The initial construction of the matrix profile for each sample may be computationally intensive for large datasets. To balance accuracy and computational efficiency, the sample_size parameter can be adjusted to determine the number of samples utilized to compute the minimum distance annotation.

The significance of shapelets is determined by the difference between the ab-join of a label with any other label and the self-join of the label, selecting the shapelets with the greatest absolute values. This method is detailed in the work of Zhu et al. (2020).

When strategy is set to “random”, the shapelet tree is constructed by randomly sampling n_shapelets within the range defined by min_shapelet_size and max_shapelet_size. This method is detailed in the work of Karlsson et al. (2016). Alternatively, shapelets can be sampled with a specified coverage_probability and variability. By specifying a coverage probability, we define the probability of including a point in the extracted shapelet. If coverage_probability is set, min_shapelet_size and max_shapelet_size are ignored.

References

Zhu, Y., et al. 2020.

The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code. Data Mining and Knowledge Discovery, 34, pp.949-979.

Karlsson, I., Papapetrou, P. and Boström, H., 2016.

Generalized random shapelet forests. Data mining and knowledge discovery, 30, pp.1053-1085.

apply(x, check_input=True)[source]#

Return the index of the leaf that each sample is predicted by.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
ndarray of shape (n_samples, )

For every sample, return the index of the leaf that the sample ends up in. The index is in the range [0; node_count].

Examples

Get the leaf probability distribution of a prediction:

>>> from wildboar.datasets import load_gun_point
>>> from wildboar.tree import ShapeletTreeClassifier
>>> X, y = load_gun_point()
>>> tree = ShapeletTreeClassifier()
>>> tree.fit(X, y)
>>> leaves = tree.apply(X)
>>> tree.tree_.value.take(leaves, axis=0)
array([[0., 1.],
       [0., 1.],
       [1., 0.]])

This is equvivalent to using tree.predict_proba.

decision_path(x, check_input=True)[source]#

Compute the decision path of the tree.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
sparse matrix of shape (n_samples, n_nodes)

An indicator array where each nonzero values indicate that the sample traverses a node.

fit(x, y, sample_weight=None, check_input=True)[source]#

Fit a classification tree.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The training time series.

yarray-like of shape (n_samples,)

The target values.

sample_weightarray-like of shape (n_samples,), optional

If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

check_inputbool, optional

Allow to bypass several input checks.

Returns:
self

This instance.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(x, check_input=True)[source]#

Predict the regression of the input samples x.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples,)

The predicted classes.

predict_proba(x, check_input=True)[source]#

Predict class probabilities of the input samples X.

The predicted class probability is the fraction of samples of the same class in a leaf.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples, n_classes)

The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

score(X, y, sample_weight=None)[source]#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

Mean accuracy of self.predict(X) w.r.t. y.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

class wildboar.tree.ShapeletTreeRegressor(*, n_shapelets='log2', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_impurity_decrease=0.0, impurity_equality_tolerance=None, strategy='warn', shapelet_size=0.1, sample_size=1.0, min_shapelet_size=0, max_shapelet_size=1, coverage_probability=None, variability=1, alpha=None, metric='euclidean', metric_params=None, criterion='squared_error', random_state=None)[source]#

A shapelet tree regressor.

Parameters:
n_shapeletsint, optional

The number of shapelets to sample at each node.

max_depthint, optional

The maximum depth of the tree. If None the tree is expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint, optional

The minimum number of samples to split an internal node.

min_samples_leafint, optional

The minimum number of samples in a leaf.

min_impurity_decreasefloat, optional

A split will be introduced only if the impurity decrease is larger than or equal to this value.

impurity_equality_tolerancefloat, optional

Tolerance for considering two impurities as equal. If the impurity decrease is the same, we consider the split that maximizes the gap between the sum of distances.

  • If None, we never consider the separation gap.

Added in version 1.3.

strategy{“best”, “random”}, optional

The strategy for selecting shapelets.

  • If “random”, n_shapelets shapelets are randomly selected in the range defined by min_shapelet_size and max_shapelet_size

  • If “best”, n_shapelets shapelets are selected per input sample of the size determined by shapelet_size.

Added in version 1.3: Add support for the “best” strategy. The default will change to “best” in 1.4.

shapelet_sizeint, float or array-like, optional

The shapelet size if strategy=”best”.

  • If int, the exact shapelet size.

  • If float, a fraction of the number of input timestep.

  • If array-like, a list of float or int.

Added in version 1.3.

sample_sizefloat, optional

The size of the sample to determine the shapelets, if shapelet_size=”best”.

Added in version 1.3.

min_shapelet_sizefloat, optional

The minimum length of a shapelets expressed as a fraction of n_timestep.

max_shapelet_sizefloat, optional

The maximum length of a shapelets expressed as a fraction of n_timestep.

coverage_probabilityfloat, optional

The probability that a time step is covered by a shapelet, in the range 0 < coverage_probability <= 1.

  • For larger coverage_probability, we get larger shapelets.

  • For smaller coverage_probability, we get shorter shapelets.

variabilityfloat, optional

Controls the shape of the Beta distribution used to sample shapelets. Defaults to 1.

  • Higher variability creates more uniform intervals.

  • Lower variability creates more variable intervals sizes.

alphafloat, optional

Dynamically decrease the number of sampled shapelets at each node according to the current depth, i.e.:

::

w = 1 - exp(-abs(alpha) * depth)

  • if alpha < 0, the number of sampled shapelets decrease from

    n_shapelets towards 1 with increased depth.

  • if alpha > 0, the number of sampled shapelets increase from 1

    towards n_shapelets with increased depth.

  • if None, the number of sampled shapelets are the same

    independent of depth.

metricstr or list, optional
  • If str, the distance metric used to identify the best

    shapelet.

  • If list, multiple metrics specified as a list of

    tuples, where the first element of the tuple is a metric name and the second element a dictionary with a parameter grid specification. A parameter grid specification is a dict with two mandatory and one optional key-value pairs defining the lower and upper bound on the values and number of values in the grid. For example, to specify a grid over the argument r with 10 values in the range 0 to 1, we would give the following specification: dict(min_r=0, max_r=1, num_r=10).

    Read more about metric specifications in the User guide.

Changed in version 1.2: Added support for multi-metric shapelet transform

metric_paramsdict, optional

Parameters for the distance measure. Ignored unless metric is a string.

Read more about the parameters in the User guide.

criterion{“squared_error”}, optional

The criterion used to evaluate the utility of a split.

Deprecated since version 1.1: Criterion “mse” was deprecated in v1.1 and removed in version 1.2.

random_stateint or RandomState
  • If int, random_state is the seed used by the

    random number generator

  • If numpy.random.RandomState instance, random_state

    is the random number generator

  • If None, the random number generator is the

    numpy.random.RandomState instance used by numpy.random.

Attributes:
tree_Tree

The internal tree representation

Notes

When strategy is set to “best”, the shapelet tree is constructed by selecting the top n_shapelets per sample. The initial construction of the matrix profile for each sample may be computationally intensive for large datasets. To balance accuracy and computational efficiency, the sample_size parameter can be adjusted to determine the number of samples utilized to compute the minimum distance annotation.

The significance of shapelets is determined by the difference between the ab-join of a label with any other label and the self-join of the label, selecting the shapelets with the greatest absolute values. This method is detailed in the work of Zhu et al. (2020).

When strategy is set to “random”, the shapelet tree is constructed by randomly sampling n_shapelets within the range defined by min_shapelet_size and max_shapelet_size. This method is detailed in the work of Karlsson et al. (2016). Alternatively, shapelets can be sampled with a specified coverage_probability and variability. By specifying a coverage probability, we define the probability of including a point in the extracted shapelet. If coverage_probability is set, min_shapelet_size and max_shapelet_size are ignored.

References

Zhu, Y., et al. 2020.

The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code. Data Mining and Knowledge Discovery, 34, pp.949-979.

Karlsson, I., Papapetrou, P. and Boström, H., 2016.

Generalized random shapelet forests. Data mining and knowledge discovery, 30, pp.1053-1085.

apply(x, check_input=True)[source]#

Return the index of the leaf that each sample is predicted by.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
ndarray of shape (n_samples, )

For every sample, return the index of the leaf that the sample ends up in. The index is in the range [0; node_count].

Examples

Get the leaf probability distribution of a prediction:

>>> from wildboar.datasets import load_gun_point
>>> from wildboar.tree import ShapeletTreeClassifier
>>> X, y = load_gun_point()
>>> tree = ShapeletTreeClassifier()
>>> tree.fit(X, y)
>>> leaves = tree.apply(X)
>>> tree.tree_.value.take(leaves, axis=0)
array([[0., 1.],
       [0., 1.],
       [1., 0.]])

This is equvivalent to using tree.predict_proba.

decision_path(x, check_input=True)[source]#

Compute the decision path of the tree.

Parameters:
xarray-like of shape (n_samples, n_timestep) or (n_samples, n_dims, n_timestep)

The input samples.

check_inputbool, optional

Bypass array validation. Only set to True if you are sure your data is valid.

Returns:
sparse matrix of shape (n_samples, n_nodes)

An indicator array where each nonzero values indicate that the sample traverses a node.

fit(x, y, sample_weight=None, check_input=True)[source]#

Fit the estimator.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The training time series.

yarray-like of shape (n_samples,)

Target values as floating point values.

sample_weightarray-like of shape (n_samples,), optional

If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

check_inputbool, optional

Allow to bypass several input checks.

Returns:
self

This object.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(x, check_input=True)[source]#

Predict the value of x.

Parameters:
xarray-like of shape (n_samples, n_timesteps)

The input time series.

check_inputbool, optional

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Returns:
ndarray of shape (n_samples,)

The predicted classes.

score(X, y, sample_weight=None)[source]#

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True values for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

\(R^2\) of self.predict(X) w.r.t. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score. This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

wildboar.tree.plot_tree(clf, *, ax=None, bbox_args=dict(), arrow_args=dict(arrowstyle='<-'), max_depth=None, class_labels=True, fontsize=None, node_labeler=None)[source]#

Plot a tree

Parameters:
clftree-based estimator

A decision tree.

axaxes, optional

The axes to plot the tree to.

bbox_argsdict, optional

Arguments to the node box.

arrow_argsdict, optional

Arguments to the arrow.

max_depthint, optional

Only show the branches until max_depth.

class_labelsbool or array-like, optional

Show the classes

  • if True, show classes from the classes_ attribute of the decision tree.

  • if False, show leaf probabilities.

  • if array-like, show classes from the array.

fontsizeint, optional

The font size. If None, the font size is determined automatically.

node_labelercallable, optional

A function returning the label for a node on the form f(node) -> str).

  • If node.children is not None the node is a leaf.

  • node._attr contains information about the node:

    • n_node_samples: the number of samples reaching the node

    • if leaf, value is an array with the fractions of labels reaching the leaf (in case of classification); or the mean among the samples reach the leaf (if regression). Determine if it is a classification or regression tree by inspecting the shape of the value array.

    • if branch, threshold contains the threshold used to split the node.

    • if branch, dim contains the dimension from which the attribute was extracted.

    • if branch, attribute contains the attribute used for computing the feature value. The attribute depends on the estimator.

Returns:
axes

The axes.

Examples

>>> from wildboar.datasets import load_two_lead_ecg
>>> from wildboar.tree import ShapeletTreeClassifier, plot_tree
>>> X, y = load_two_lead_ecg()
>>> clf = ShapeletTreeClassifier(strategy="random").fit(X, y)
>>> plot_tree(clf)
<Axes: >