Transform-based estimators#
Time series transformations are designed to convert time series data into
traditional column-based feature matrices suitable for input into subsequent
classifiers or regressors. Notable feature representations are Rocket and
Hydra, which employ convolutional kernels; shapelet-based transformations, which
utilize shapelet distances; and interval-based transformations, which calculate
feature values for overlapping or non-overlapping intervals. Typically these
transformations are used with a linear estimator such as
RidgeClassifierCV
.
Through this section we will use the TwoLeadECG dataset.
from wildboar.datasets import load_two_lead_ecg
X_train, X_test, y_train, y_test = load_two_lead_ecg(merge_train_test=False)
Shapelet-based transform#
Shapelet-based transformation uses shapelets, i.e., discriminatory subsequence, and a distance metric to construct a feature representation.
Random shapelet transform#
The simplest and often effective approach is to sample a large number of
shapelets and include all without filtering in the transformation. This
approach is implementer in
RandomShapeletClassifier
.
from wildboar.linear_model import RandomShapeletClassifier
clf = RandomShapeletClassifier(random_state=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.990342405618964
We can change the metric
(by default the metric is set to the Euclidean
distance):
clf = RandomShapeletClassifier(metric="scaled_manhattan", random_state=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.990342405618964
We can also specify multiple metrics, see metric specification guide for more information on how to format the metrics. We can also limit the size of shapelets.
In the example, each time series i is transformed into a new representation
consisting of n rows and n_shapelets features. Here, the i-th time series
is characterized by the minimum distance to each shapelet in the set 0, …,
n_shapelets. By default, each feature is normalized to have a mean of zero and
a standard deviation of one. However, this normalization can be disabled by
setting the parameter normalize=False
.
Dilated shapelet transform#
A more recent approach, described by Guillaume et al. 2021 [4], constructs
a feature representation that incorporates not only the minimal distance but
also the occurrence counts of shapelets and the index of the minimal distance.
Consequently, each shapelet is characterized by a triad of features rather than
a singular feature. Furthermore, the Dilated Shapelet Transform (DST) expands
shapelets by inserting empty values, thereby increasing the “receptive field”
of the shapelets. This method is implemented within Wildboar as
DilatedShapeletClassifier
from wildboar.linear_model import DilatedShapeletClassifier
clf = DilatedShapeletClassifier(random_state=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.9991220368744512
The DST classifier supports only a single metric; however, multiple
parameters are available for tuning. Specifically, the size of the shapelets
can be adjusted. By default, shapelets of length 7, 9, and 11 are
utilized. This can be modified using the parameters shapelet_size
,
min_shapelet_size
, and max_shapelet_size
. For instance:
clf = DilatedShapeletClassifier(shapelet_size=[7, 11], random_state=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.9991220368744512
If the parameters min_shapelet_size
or max_shapelet_size
are specified,
all odd sizes ranging from n_timesteps * min_shapelet_size
to n_timesteps
* max_shapelet_size
will be utilized.
The likelihood of z-normalizing the shapelets can be adjusted by modifying the
normalize_prob
parameter. It defaults to 0.8, indicating that 80 percent
of the shapelets undergo normalization.
clf = DilatedShapeletClassifier(normalize_prob=0.1, random_state=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.9920983318700615
We can also determine the occurrence threshold, that is, the threshold for
ascertaining the occurrence counts, by modifying the lower
and upper
parameters. These parameters delineate the bounds within which the occurrence
threshold is sampled. By default, it is sampled from the 5 to 10 percent
smallest distances.
clf = DilatedShapeletClassifier(lower=0.1, upper=0.3, random_state=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.9991220368744512
Castor transform#
Castor (Competing diAlated Shapelet TransfORm) [5] is a transformation
technique for time series data. Analogous to Hydra, Castor enables shapelets to
compete, and akin to DST, it utilizes the occurrence of shapelets. Castor is
characterized by two principal parameters: the number of groups (n_groups
)
and the number of shapelets (n_shapelets
). These parameters collectively
define the dimensions of the transformed feature space. By convention, we
employ 64 groups, each comprising 8 shapelets.
from wildboar.linear_model import CastorClassifier
clf = CastorClassifier(random_state=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
1.0
Castor has several tunable parameters, with n_group
and n_shapelets
being the most influential in determining classification accuracy. Generally,
increasing the values of n_groups
and n_shapelets
enhances accuracy, as
these parameters adjust the level of competition among features. For instance,
a configuration with n_groups=1
and n_shapelets=1024
results in maximal
competition, leading to a feature representation that closely resembles a
pattern dictionary. Conversely, setting n_groups=1024
and n_shapelets=1
eliminates competition, yielding a transformation akin to a traditional
shapelet-based transform, such as the Dilated Shapelet Transform (DST).
A recommended approach for parameter tuning is to incrementally double the
values of both n_group
and n_shapelets
in successive iterations.
Warning
For better performance with multivariate datasets, set n_shapelets
to
n_shapelets * n_dims to ensure feature variability.
from wildboar.linear_model import CastorClassifier
clf = CastorClassifier(n_groups=128, n_shapelets=16, random_state=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
1.0