User Guide#

Typically, the configuration of machine learning problems involves a collection of n data samples with the objective of forecasting attributes of unfamiliar data. In the Wildboar framework, the focus is on machine learning problems where the data samples are in the form of series, such as time series or other types of data that are ordered chronologically or logically.

Note

For solving general machine learning problems with Python, consider using scikit-learn.

Similar to general machine learning problems, temporal machine learning is concerned with problems that fall into different categories

  • Supervised learning, in which the data series are labeled with additional information. The additional information can be either numerical or nominal

  • In classification the time series belong to one of two or more labels and the goal is to learn a function that can label unlabeled time series.

  • In regression the time series are labeled with a numerical attribute and the task is to assign a new numerical value to an unlabeled time series.

Loading an example dataset#

In order to start the exploration of Wildboar and temporal machine learning, it is essential to acquire a set of data. Wildboar conveniently includes several conventional datasets sourced from the time series community, which are accessible in the UCR Time series repository.

In the following example, we load the dataset synthetic_control and the TwoLeadECG datasets.

from wildboar.datasets import load_synthetic_control, load_two_lead_ecg
x, y = load_synthetic_control()
x_train, x_test, y_train, y_test = load_two_lead_ecg(merge_train_test=False)

Preserving the original training and testing splits from the UCR repository can be achieved by disabling the merge_train_test option.

A more robust and reliable method for splitting the datasets into training and testing partitions is to use the model selection functions provided by scikit-learn.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

The datasets are Numpy ndarray s with x.ndim==2 and y.ndim==1. We can get the number of samples and time points.

n_samples, n_timestep = x.shape

Note

Wildboar also supports multivariate time series using 3d-arrays, i.e., we can get the shape of the dataset by n_samples, n_dims, n_timestep = x.shape. Since operations are often performed over the temporal dimension, we opt for having that as the last dimension of the array. Since we prefer c-order arrays, this means that the data is contiguous in memory. A robust approach for getting the number of samples and number of time steps irrespective of univariate (2d) or multivariate (3d) time series is:

n_samples, n_timestep = x.shape[0], x.shape[-1]

In the example, we use load_two_lead_ecg and load_synthetic_control to load the datasets. A more general approach is to use the load_dataset-function from the same modules.

from wildboar.datasets import load_dataset
x, y = load_dataset("synthetic_control")

load_dataset accepts multiple parameters for specifying where to load data from and how to preprocess the data. By default, Wildboar loads datasets from the wildboar/ucr repository, which include datasets from UCR time series repository. The user can specify a different repository using the repository argument. For example, we can load the regression task FloodModeling1 from the UEA & UCR Time Series Extrinsic Regression Repository, standardizing each time series with zero mean and unit variance using the following snippet:

x, y = load_dataset(
   "FloodModeling1", repository="wildboar/tsereg", preprocess="standardize"
)

Learning and predicting#

Estimators in Wildboar implements the same interface as estimators of scikit-learn. We can fit an estimator to an input dataset and predict the label of a new sample.

An example of a temporal estimator is the ensemble.ShapeletForestClassifier which implements the random shapelet forest classifier.

from wildboar.ensemble import ShapeletForestClassifier
clf = ShapeletForestClassifier()
clf.fit(x_train, y_train)

We fit the classifier (clf) using the training samples, and use the same object to predict the label of a previously unseen sample.

clf.predict(x_test[-1:, :])  # outputs array([6.])

Note

The predict function expects a ndarray of shape (n_samples, n_timestep), where n_timestep is the size of training timestep.

Wildboar also simplifies experimentation over multiple datasets by allowing the user to repeatedly load several datasets from a repository.

from wildboar.datasets import load_datasets

for name, (x_train, x_test, y_train, y_test) in load_datasets(
   "wildboar/ucr",
   collection="bake-off",
   merge_train_test=False,
   filter="n_samples<=300",
):
   clf = clone(clf)  # from sklearn import clone
   clf.fit(x, y)
   print(f"{name}: {clf.score(x, y)}")

In the example, we load all datasets in the bake-off collection from the wildboar/ucr repository, filtering datasets with less than 300 samples. For each dataset we load, we clone (to reuse the same random seed) and fit the estimator. Then we print the dataset name and the predictive performance to the screen. You can read more about datasets in the API-documentation.

Transforming time series to tabular data#

Despite the numerous estimators specialized for temporal data, an even larger collection of methods exist for tabular data (e.g., as implemented by scikit-learn). For this purpose, Wildboar implements several transformers that can be used to transform temporal data to tabular data. Wildboar estimators follow the same convention as scikit-learn and implements a fit-method that learns the representation and a transform-method that outputs a new tabular representation of each input sample.

from wildboar.transform import RocketTransform
rocket = RocketTransform()
rocket.fit(x)
tabluar_x = rocket.transform(x)

One of Wildboars main design goals is to seamlessly interoperate with scikit-learn. As such, we can use Wildboar transformers to build scikit-learn pipelines.

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

clf = make_pipeline(
   RocketTransform(),
   LogisticRegression(),
)
clf.fit(x, y)
clf.score(x, y)

Warning

In the above example, we train and evaluate the model on the same data. This is bad practice. Instead, we should use a proper hold-out set when estimating the pipelines performance.

Exploring model performance#

Wildboar implements several methods for explaining classifiers, e.g., using counterfactual reasoning or input dependencies.

from wildboar.explain import IntervalImportance
i = IntervalImportance()
i.fit(clf, x, y)
i.plot(x, y=y)

The wildboar.explain.IntervalImportance-class identifies temporal regions that are responsible for the classifier performance. It does so by breaking the dependency between continuous intervals and the label while reevaluating the predictive performance of the classifier of sample-wise shuffled intervals. In the example, we evaluate the in-sample importance, which captures the reliance of the model on a particular interval.

_images/interval-light.svg_images/interval-dark.svg

The explain.IntervalImportance.plot method can be used to visualize the interval importance, or we can return the full importance matrix.

>>> i.importance_.mean()
[..., 0.31, 0.30, 0.34, ...]

Model persistence#

All Wildboar models can be persisted to disk using pickle

import pickle
repr = pickle.dumps(clf) # clf fitted earlier
clf_ = pickle.loads(repr)
clf_.predict(x_test[-1:, :]) # outputs array([6.])

Models persisted using an older version of Wildboar is not guaranteed to work when using a newer version (or vice versa).

Warning

The pickle module is not secure. Only unpickle data you trust. Read more in the Python documentation