Dimension selection#
In multi-variate settings, it is often useful to reduce the number of
dimensions of the time series. Wildboar supports dimension selection
in the wildboar.dimension_selection module and implements a few
strategies inspired by traditional feature selection.
Dimension variance threshold#
The simplest approach computes the variance between the pairwise distance between time series within each dimension and is used to filter dimensions where the time series have low or no variance.
from wildboar.datasets import load_ering
from wildboar.dimension_selection import DistanceVarianceThreshold
X, y = load_ering()
t = DistanceVarianceThreshold(threshold=9)
t.fit(X, y)
DistanceVarianceThreshold(threshold=9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
We set the variance threshold to 9 to filter out any dimensions with a pairwise distance variance greater than 9.
t.get_dimensions()
array([ True, False, True, True])
The filter removes only the third dimension.
t.transform(X).shape
(300, 3, 65)
And the resulting transformation contains only the three remaining dimensions.
Sequential dimension selector#
Sequentially select a set of dimensions by adding (forward) or removing (backward) dimensions to greedily form a subset. At each iteration, the algorithm chooses the best dimension to add or remove based on the cross validation score of a classifier or regressor.
from wildboar.datasets import load_ering
from wildboar.dimension_selection import SequentialDimensionSelector
from wildboar.distance import KNeighborsClassifier
X, y = load_ering()
t = SequentialDimensionSelector(KNeighborsClassifier(), n_dims=2)
t.fit(X, y)
SequentialDimensionSelector(estimator=KNeighborsClassifier(), n_dims=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
KNeighborsClassifier()
Parameters
We select the dimensions that have the most predictive performance.
t.get_dimensions()
array([ True, False, False, True])
The resulting transformation contains only those dimensions.
t.transform(X).shape
(300, 2, 65)
from wildboar.linear_model import RocketClassifier
X_train, X_test, y_train, y_test = load_ering(merge_train_test=False)
clf = RocketClassifier(random_state=2)
clf.fit(X_train, y_train)
RocketClassifier(random_state=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Using all dimensions, the Rocket classifier has an accuracy of 0.92.
By using the make_pipeline function from scikit-learn
we can reduce the number of dimensions.
from sklearn.pipeline import make_pipeline
clf = make_pipeline(
SequentialDimensionSelector(KNeighborsClassifier(), n_dims=3),
RocketClassifier(random_state=2),
)
clf.fit(X_train, y_train)
Pipeline(steps=[('sequentialdimensionselector',
SequentialDimensionSelector(estimator=KNeighborsClassifier(),
n_dims=3)),
('rocketclassifier', RocketClassifier(random_state=2))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Parameters
KNeighborsClassifier()
Parameters
Parameters
Using only the selected dimensions, the Rocket classifier instead has an accuracy of 0.98.