Distance#

Pairwise distance#

A typical workflow involves computing the distance between pairs of time series. In Wildboar, we accomplish this using the pairwise_distance-function which computes the distance between each time series in X and optionally Y. For example, we can compute the _Euclidean distance (default) between each pair from time series of a single array as:

from wildboar.datasets import load_two_lead_ecg
from wildboar.distance import pairwise_distance

X, y = load_two_lead_ecg()
pairwise_distance(X[:3])

array([[0.        , 3.51158857, 5.11514381],
       [3.51158857, 0.        , 2.35905618],
       [5.11514381, 2.35905618, 0.        ]])

Note

When computing the distance between time series in a single array, one can note that the upper and lower triangles of the distance matrix are mirror images of each other. Wildboar optimizes this and only computes the upper part of the triangle and mirrors it as the lower half, halving the computational cost. As such, it is advised to compute the distance as pairwise_distance(X) and not pairwise_distance(X, X). Wildboar tries to be smart and computes the single triangle if X is Y, i.e., X and Y are the same object. However, it will not detect if (X == Y).all() (e.g., pairwise_distance(X, X.copy())) and will compute the full matrix.

If we pass a second array, then the returned matrix is the pairwise distance between the time series from both X and Y.

pairwise_distance(X[:3], X[3:6])

array([[4.85497117, 5.96086309, 6.18777928],
       [2.00606825, 5.23060212, 4.27419835],
       [1.64445581, 6.38965963, 4.79102936]])

By default, the pairwise_distance function computes the Euclidean distance. We can change the metric by specifying the metric parameter as one of the supported metrics:

pairwise_distance(X[:3], X[3:6], metric="edr")

array([[0.59756098, 0.47560976, 0.64634146],
       [0.08536585, 0.03658537, 0.13414634],
       [0.09756098, 0.25609756, 0.12195122]])

We can also specify optional extra parameters to the metric using the metric_params-parameter:

pairwise_distance(X[:3], X[3:6], metric="twe", metric_params={"stiffness": 0.3})

array([[76.20881199, 73.62554784, 88.5536877 ],
       [27.49142159, 60.56024904, 50.24551102],
       [20.45513015, 81.60658533, 54.06099416]])

We support a multitude of input configurations and the return value of pairwise_distance depends on the input shape(s) of X (with x_samples) and Y (with y_samples) (and the dim parameter):

A single 1d-array `X`

Returns the scalar $0$.

A single 2d-array `X`

Returns a 2d-array with the distance between each sample in X of shape (x_samples, x_samples).

pairwise_distance(X[0:2])

array([[0.        , 3.51158857],
       [3.51158857, 0.        ]])

A 1d-array `X` and a 1d-array `Y`

Returns a scalar with the distance of X to Y

pairwise_distance(X[0], X[2])

5.115143810620726

A 1d-array `X` and a 2d-array `Y`

Returns a 1d-array of the distance between X and each sample in Y of shape (y_samples, ). If Y is a 1d-array and X is a 2d-array, a 1d-array, of shape (x_samples, ), with the distance between Y and each sample in X is returned.

pairwise_distance(X[0], X[2:5])

array([5.11514381, 4.85497117, 5.96086309])

A 2d-array `X` and a 2d-array `Y`

Returns a 2d-array with the distance between each sample in X and each sample in Y of shape (x_samples, y_samples).

pairwise_distance(X[0:2], X[2:4])

array([[5.11514381, 4.85497117],
       [2.35905618, 2.00606825]])

A 3d-array `X` and a 3d-array `Y`

Returns a 2d-array with the distance between each sample in X and each sample in Y of shape (x_samples, y_samples).

x = X[0:6].reshape(2, 3, -1)
y = X[6:12].reshape(2, 3, -1)
pairwise_distance(x, y, dim="mean")

array([[3.86658733, 3.75622584],
       [5.29468998, 6.06939656]])

If we set the parameter dim="full" we return a 3d-array of shape (n_dims, n_samples, n_timestep). Read more about multivariate support.

x = X[0:6].reshape(2, 3, -1)
y = X[6:12].reshape(2, 3, -1)
pairwise_distance(x, y, dim="full")

array([[[5.48683192, 6.60301954],
        [4.34083722, 6.35954558]],

       [[2.50507001, 0.90920635],
        [5.27646127, 4.60041068]],

       [[3.60786006, 3.75645164],
        [6.26677146, 7.24823344]]])

Paired distance#

Sometimes we do not need to compute the distance between every sample, but instead the distance between _pairs_ of samples. For this purpose, we use the paired_distance, which accepts two arrays X and Y with shapes (x_samples, x_timestep) and (y_samples, y_timesteps) with x_timestep == y_timestep for non-elastic metrics and X and Y that can be broadcast to a uniform shape.

from wildboar.distance import paired_distance

paired_distance(X[0:3], X[3:6])

array([4.85497117, 5.23060212, 4.79102936])

paired_distance(X[0], X[3:6])  # Broadcasting

array([4.85497117, 5.96086309, 6.18777928])

Similar to pairwise_distance, we support the parameters metric and metric_params accepting the same input:

paired_distance(X[0], X[3:6], metric="wdtw", metric_params={"g": 0.1})

array([0.50816474, 0.3299048 , 0.55193242])

paired_distance also supports a multitude of input configurations and the output depends on that configuration:

A 1d-array `X` and a 1d-array `Y`

Returns a scalar with the distance of X to Y.

Two arrays that can be broadcast to the same shape

Returns a 1d-array of shape (n_samples, ).

If 3d-arrays, and the parameter dim is set to “full”, we return a 2d-array of shape (n_dims, n_samples). Refer to more about multivariate support for additional details.

x = X[0:6].reshape(2, 3, -1)
y = X[6:12].reshape(2, 3, -1)
paired_distance(x, y, dim="full")

array([[5.48683192, 6.35954558],
       [2.50507001, 4.60041068],
       [3.60786006, 7.24823344]])

Minimum distance#

Another common task is to, given a query time series $q$ and a set of time series $\mathcal{T}$ find the (k) time series in the set that has the minimal distance to $q$. A naive solution is to calculate the distance between the query and every time series, selecting the k time series with the lowest distance:

pairwise_distance(X[0], X[6:12]).argmin()

np.int64(4)

Or if we want to return the k closest, we can use argpartition (here k=4):

from numpy import argpartition

argpartition(pairwise_distance(X[0:10], X[11:300]), 4)[:, :4]

array([[  6, 276, 254, 120],
       [ 85, 244,  50, 282],
       [101, 277, 270, 217],
       ...,
       [204, 101, 274, 109],
       [208, 213, 146, 193],
       [246, 111, 255,  43]], shape=(10, 4))

Wildboar implements this pattern in the argmin_distance function. The function uses early abandoning to avoid computing the full distance for time series that would never be among the closest.

from wildboar.distance import argmin_distance

argmin_distance(X[0:10], X[11:300], k=4)

array([[120, 254, 276,   6],
       [282,  50, 244,  85],
       [217, 270, 277, 101],
       ...,
       [109, 274, 101, 204],
       [193, 213, 146, 208],
       [ 43, 255, 246, 111]], shape=(10, 4))

The output of both calls are equivalent but the latter is roughly three times faster for most metrics.

The argmin_distance function accepts a parameter called lower_bound, which takes an array containing pairwise lower bounds between time series. The user can select the specific lower bounds to include in this array, depending on needs or the specific distance measures. This flexibility allows the user to tailor the estimation process to be as efficient as possible by choosing an appropriate level of simplicity or complexity for the lower bounds.

from wildboar.distance.lb import DtwKeoghLowerBound

lb_keogh = DtwKeoghLowerBound(r=0.1).fit(X[11:300])
argmin_distance(
    X[0:10],
    X[11:300],
    metric="dtw",
    metric_params={"r": 0.1},
    lower_bound=lb_keogh.transform(X[0:10]),
)

array([[147],
       [ 85],
       [ 28],
       [164],
       [ 69],
       [ 47],
       [288],
       [204],
       [226],
       [246]])

Lower bounds and supported metric.#
Lower bound	Metric	Comment
`DtwKeoghLowerBound`	`"dtw"`
`DtwKimLowerBound`	`"dtw"`
`SaxLowerBound`	`"euclidean"`	Requires z-normalization
`PaaLowerBound`	`"euclidean"`	Requires z-normalization

Multivariate support#

As described, both paired_distance and pairwise_distance support multivariate time series by computing the “interdimensional” distance between time series and (by default) reporting the mean (dim=”mean”). Optionally, we can return the full distance matrix by setting dim="full":

x = X[0:6].reshape(2, 3, -1)
y = X[6:12].reshape(2, 3, -1)
pairwise_distance(x, y, dim="full")

array([[[5.48683192, 6.60301954],
        [4.34083722, 6.35954558]],

       [[2.50507001, 0.90920635],
        [5.27646127, 4.60041068]],

       [[3.60786006, 3.75645164],
        [6.26677146, 7.24823344]]])

By setting dim="full", Wildboar returns the full array of distances between all dimensions. The returned array has the shape (n_dims, x_samples, y_samples). Similarly, we can compute the paired distance:

paired_distance(x, y, dim="full")

array([[5.48683192, 6.35954558],
       [2.50507001, 4.60041068],
       [3.60786006, 7.24823344]])

Note that the paired_distance returns an array of shape (n_dims, n_samples).

If we are interested in the distance between a single dimension we can either slice the input data:

d = pairwise_distance(x, y, dim="full")
d[0]

array([[5.48683192, 6.60301954],
       [4.34083722, 6.35954558]])

or slice the full distance matrix:

p = paired_distance(x, y, dim="full")
p[0]

array([5.48683192, 6.35954558])

If we are only interested in a single dimension, we can set the dim parameter to the dimension we are want:

pairwise_distance(x, y, dim=0)

array([[5.48683192, 6.60301954],
       [4.34083722, 6.35954558]])

and for paired_distance:

paired_distance(x, y, dim=0)

array([5.48683192, 6.35954558])

By setting dim to the desired dimension, we avoid computing the distance between unwanted dimensions.

Subsequence search#

Wildboar can also identify the (minimum) subsequence distance, i.e., $\min\limits_{t'\in t}(s, t')$, where $s$ is a query and $t$ is a time series, with $|s| \le |t|$. Wildboar support both pairwise_subsequence_distance and paired_subsequence_distance which works similar to their non-subsequence counterparts, but also subsequence_match to identify subsequences with a distance within a user specified _threshold_. Since Numpy does not support jagged-arrays, both the paired and pairwise subsequence functions accept subsequences as list containing 1d-numpy arrays and 2d-numpy arrays with subsequences with the same timestep.

Pairwise subsequence distance#

Pairwise subsequence distance requires two inputs, the subsequences Y (argument 1) and the time series X (argument 2):

from wildboar.distance import pairwise_subsequence_distance

pairwise_subsequence_distance(X[0, 30:60], X[6:12])

array([1.66371456, 2.11914265, 1.13076667, 1.99043671, 1.73408875,
       1.84227457])

We can also pass multiple subsequences with the same number of time steps as a 2d-numpy array:

pairwise_subsequence_distance(X[0:3, 30:60], X[6:12])

array([[1.66371456, 1.2028058 , 1.17725071],
       [2.11914265, 0.85972633, 0.81245725],
       [1.13076667, 0.85367621, 1.59786396],
       [1.99043671, 0.86957415, 1.65083422],
       [1.73408875, 0.64041732, 1.38897019],
       [1.84227457, 1.33156061, 1.06460666]])

or with different number of time steps as a Python list:

pairwise_subsequence_distance([X[0, 30:60], X[1, 0:10]], X[6:12])

array([[1.66371456, 0.56698045],
       [2.11914265, 0.99489626],
       [1.13076667, 0.6790517 ],
       [1.99043671, 0.16754772],
       [1.73408875, 0.10973127],
       [1.84227457, 0.50583639]])

Since we support multiple input configurations, the return value depends on the inputs:

A 1d-array `Y` and a 1d-array `X`: Returns a scalar with the minimum distance of Y to X.
A 1d-array `Y` and a 2d or 3d-array `X`: Returns a 1d-array of shape (n_samples, ) with the minimum distance of the subsequence to each sample.
A 2d-array or list `Y` and a 1d-array `X`: Returns a 1d-array of shape (n_subsequences, ) with the minimum distance of each subsequence to the sample.
A 2d-array or list `Y` and a 2d or 3d-array `X`: Returns a 2d-array of shape (n_subsequences, n_samples) with the minimum distance of each subsequence to each sample.

Note

pairwise_subsequence_distance only supports univariate subsequences. As such, the dim parameter only accepts a int denoting the dimension in which to compute the subsequence distance. We can use the following code to compute the distance between the subsequence and each dimension of every sample, with consistent output.

def pairwise_sd_full(y, x):
    return np.stack(
        [pairwise_subsequence_distance(y, x, dim=dim) for dim in range(x.shape[1])],
        axis=0,
    )


x = X[5:14].reshape(3, 3, -1)
pairwise_sd_full(X[0, 30:60], x)

array([[2.21688671, 1.13076667, 1.84227457],
       [1.66371456, 1.99043671, 1.83210644],
       [2.11914265, 1.73408875, 1.50884094]])

pairwise_sd_full([X[0, 30:60], X[1, 10:20]], x)

array([[[2.21688671, 0.18507116],
        [1.13076667, 0.11177626],
        [1.84227457, 0.15611733]],

       [[1.66371456, 0.21780536],
        [1.99043671, 0.13350353],
        [1.83210644, 0.09710811]],

       [[2.11914265, 0.75114125],
        [1.73408875, 0.13489775],
        [1.50884094, 0.09806374]]])

We can request the best matching index, i.e., the index where the minimum distance between the subsequence and the time series is identified, by setting return_index to True:

>>> dist, idx = pairwise_subsequence_distance([X[0,30:60]], X[6:12],return_index=True)
>>> dist
array([1.66371456, 2.11914265, 1.13076667, 1.99043671, 1.73408875,
       1.84227457])
>>> idx
array([28, 30, 28, 34, 30, 28])

For example, we can see that the minimum distance of the subsequence is located at index 28 for both the first, third and last sample.

Benchmark#

As a component of the Wildboar test suite, we systematically evaluate the performance of the estimators and, notably, the metrics, which are a crucial element of numerous tasks. All Wildboar metrics, encompassing subsequence and elastic metrics, are implemented in Cython to optimize CPU and memory efficiency. Upon examining the relative performance of the various metrics and their theoretical time and space complexity, it is evident that the elastic metrics are generally two to three orders of magnitude less efficient than the non-elastic metrics.

Metrics#

Metric name	Time	Space	Minimum	Maximum	Mean
`chebyshev`	$O(n)$	$O(1)$	1	1	1
`angular`	$O(n)$	$O(1)$	0.97686	0.87686	0.98368
`cosine`	$O(n)$	$O(1)$	0.98282	1.11131	0.98268
`manhattan`	$O(n)$	$O(1)$	0.95506	1.14157	0.96041
`euclidean`	$O(n)$	$O(1)$	0.94631	0.83231	0.92207
`normalized_euclidean`	$O(n)$	$O(1)$	0.55527	0.73285	0.55538
`minkowski`	$O(n)$	$O(1)$	0.41892	0.40887	0.42778
`lcss`	$O(n^2)$	$O(n)$	0.00061	0.00104	0.00064
`wlcss`	$O(n^2)$	$O(n)$	0.00054	0.00091	0.00056
`edr`	$O(n^2)$	$O(n)$	0.00030	0.00052	0.00032
`ddtw`	$O(n^2)$	$O(n)$	0.00028	0.00048	0.00029
`dtw`	$O(n^2)$	$O(n)$	0.00028	0.00048	0.00029
`wddtw`	$O(n^2)$	$O(n)$	0.00028	0.00048	0.00029
`wdtw`	$O(n^2)$	$O(n)$	0.00028	0.00048	0.00029
`erp`	$O(n^2)$	$O(n)$	0.00028	0.00048	0.00029
`twe`	$O(n^2)$	$O(n)$	0.00021	0.00035	0.00022
`msm`	$O(n^2)$	$O(n)$	0.00019	0.00032	0.00019

Subsequence metrics#

Metric name	Time	Space	Minimum	Maximum	Mean
`chebyshev`	$O(m\times n)$	$O(1)$	1	1	1
`euclidean`	$O(m\times n)$	$O(1)$	0.82372	0.49048	0.79202
`manhattan`	$O(m\times n)$	$O(1)$	0.66394	0.75967	0.67113
`mass`	$O(m\times n)$	$O(1)$	0.56287	0.61275	0.56196
`scaled_euclidean`	$O(m\times n)$	$O(\text{max}(m; n))$	0.49453	0.59627	0.49988
`cosine`	$O(m\times n)$	$O(1)$	0.42765	0.58025	0.43553
`angular`	$O(m\times n)$	$O(1)$	0.42761	0.58982	0.43474
`minkowski`	$O(m\times n)$	$O(1)$	0.21051	0.33659	0.21757
`normalized_euclidean`	$O(m\times n)$	$O(1)$	0.06851	0.10321	0.07092
`lcss`	$O(m\times n^2)$	$O(\text{max}(m; n))$	0.00216	0.00413	0.00226
`scaled_dtw` [1]	$O(m\times n^2)$	$O(\text{max}(m; n))$	0.00146	0.00278	0.00153
`edr`	$O(m\times n^2)$	$O(\text{max}(m; n))$	0.00102	0.00195	0.00107
`ddtw`	$O(m\times n^2)$	$O(\text{max}(m; n))$	0.00102	0.00194	0.00106
`wddtw`	$O(m\times n^2)$	$O(\text{max}(m; n))$	0.00099	0.00190	0.00104
`dtw`	$O(m\times n^2)$	$O(\text{max}(m; n))$	0.00099	0.00189	0.00103
`wdtw`	$O(m\times n^2)$	$O(\text{max}(m; n))$	0.00097	0.00185	0.00101
`msm`	$O(m\times n^2)$	$O(\text{max}(m; n))$	0.00096	0.00182	0.00100
`erp`	$O(m\times n^2)$	$O(\text{max}(m; n))$	0.00095	0.00181	0.00099
`twe`	$O(m\times n^2)$	$O(\text{max}(m; n))$	0.00071	0.00136	0.00074

Parallelization#

Note

Wildboar employs parallelization to distribute the computation across multiple samples concurrently. Consequently, performance improvements may not be observed for a small number of extremely short time series. In certain cases, the additional time required to create threads may actually result in increased overall computation time.

Warning

These measurements are from a Github CI build server and may not accurately reflect the actual performance gains.

All functions discussed thus far support parallelization. Parallelization can be enabled by setting the n_jobs parameter to a positive integer corresponding to the number of cores to utilize, or to -1 to employ all available cores.

import time

start = time.perf_counter()
pairwise_distance(X[:100], metric="dtw", n_jobs=-1)
time.perf_counter() - start

0.10087814400003481

Or we can specify an exact integer of the number of cores:

import time

start = time.perf_counter()
pairwise_distance(X[:100], metric="dtw", n_jobs=1)
time.perf_counter() - start

0.12775288600005297

On my laptop, which is a MacBook equipped with an M2 processor, the computation of pairwise distances using the parameter n_jobs=-1 takes approximately 40 milliseconds, while performing the same computation with n_jobs=1 takes approximately 160 milliseconds.

Metric name	Time	Space	Minimum	Maximum	Mean
`chebyshev`	\(O(n)\)	\(O(1)\)	1	1	1
`angular`	\(O(n)\)	\(O(1)\)	0.97686	0.87686	0.98368
`cosine`	\(O(n)\)	\(O(1)\)	0.98282	1.11131	0.98268
`manhattan`	\(O(n)\)	\(O(1)\)	0.95506	1.14157	0.96041
`euclidean`	\(O(n)\)	\(O(1)\)	0.94631	0.83231	0.92207
`normalized_euclidean`	\(O(n)\)	\(O(1)\)	0.55527	0.73285	0.55538
`minkowski`	\(O(n)\)	\(O(1)\)	0.41892	0.40887	0.42778
`lcss`	\(O(n^2)\)	\(O(n)\)	0.00061	0.00104	0.00064
`wlcss`	\(O(n^2)\)	\(O(n)\)	0.00054	0.00091	0.00056
`edr`	\(O(n^2)\)	\(O(n)\)	0.00030	0.00052	0.00032
`ddtw`	\(O(n^2)\)	\(O(n)\)	0.00028	0.00048	0.00029
`dtw`	\(O(n^2)\)	\(O(n)\)	0.00028	0.00048	0.00029
`wddtw`	\(O(n^2)\)	\(O(n)\)	0.00028	0.00048	0.00029
`wdtw`	\(O(n^2)\)	\(O(n)\)	0.00028	0.00048	0.00029
`erp`	\(O(n^2)\)	\(O(n)\)	0.00028	0.00048	0.00029
`twe`	\(O(n^2)\)	\(O(n)\)	0.00021	0.00035	0.00022
`msm`	\(O(n^2)\)	\(O(n)\)	0.00019	0.00032	0.00019

Metric name	Time	Space	Minimum	Maximum	Mean
`chebyshev`	\(O(m\times n)\)	\(O(1)\)	1	1	1
`euclidean`	\(O(m\times n)\)	\(O(1)\)	0.82372	0.49048	0.79202
`manhattan`	\(O(m\times n)\)	\(O(1)\)	0.66394	0.75967	0.67113
`mass`	\(O(m\times n)\)	\(O(1)\)	0.56287	0.61275	0.56196
`scaled_euclidean`	\(O(m\times n)\)	\(O(\text{max}(m; n))\)	0.49453	0.59627	0.49988
`cosine`	\(O(m\times n)\)	\(O(1)\)	0.42765	0.58025	0.43553
`angular`	\(O(m\times n)\)	\(O(1)\)	0.42761	0.58982	0.43474
`minkowski`	\(O(m\times n)\)	\(O(1)\)	0.21051	0.33659	0.21757
`normalized_euclidean`	\(O(m\times n)\)	\(O(1)\)	0.06851	0.10321	0.07092
`lcss`	\(O(m\times n^2)\)	\(O(\text{max}(m; n))\)	0.00216	0.00413	0.00226
`scaled_dtw` [1]	\(O(m\times n^2)\)	\(O(\text{max}(m; n))\)	0.00146	0.00278	0.00153
`edr`	\(O(m\times n^2)\)	\(O(\text{max}(m; n))\)	0.00102	0.00195	0.00107
`ddtw`	\(O(m\times n^2)\)	\(O(\text{max}(m; n))\)	0.00102	0.00194	0.00106
`wddtw`	\(O(m\times n^2)\)	\(O(\text{max}(m; n))\)	0.00099	0.00190	0.00104
`dtw`	\(O(m\times n^2)\)	\(O(\text{max}(m; n))\)	0.00099	0.00189	0.00103
`wdtw`	\(O(m\times n^2)\)	\(O(\text{max}(m; n))\)	0.00097	0.00185	0.00101
`msm`	\(O(m\times n^2)\)	\(O(\text{max}(m; n))\)	0.00096	0.00182	0.00100
`erp`	\(O(m\times n^2)\)	\(O(\text{max}(m; n))\)	0.00095	0.00181	0.00099
`twe`	\(O(m\times n^2)\)	\(O(\text{max}(m; n))\)	0.00071	0.00136	0.00074

Distance#

Pairwise distance#

Paired distance#

Minimum distance#

Multivariate support#

Subsequence search#

Pairwise subsequence distance#

Benchmark#

Metrics#

Subsequence metrics#

Parallelization#

This Page