Distance#

Pairwise distance#

A typical workflow involves computing the distance between pairs of time series. In Wildboar, we accomplish this using the {func}`pairwise_distance`-function which computes the distance between each time series in X and optionally Y. For example, we can compute the _Euclidean distance (default if metric=None) between each pair from time series of a single array as:

>>> from wildboar.datasets import load_two_lead_ecg
>>> from wildboar.distance import pairwise_distance
>>> X, y = load_two_lead_ecg()
>>> pairwise_distance(X[:3])
array([[0.        , 3.51158857, 5.11514381],
      [3.51158857, 0.        , 2.35905618],
      [5.11514381, 2.35905618, 0.        ]])

Note

When computing the distance between time series in a single array, one can note that the upper and lower triangles of the distance matrix are mirror images of each other. Wildboar optimizes this and only computes the upper part of the triangle and mirrors it as the lower half, halving the computational cost. As such, it is advised to compute the distance as pairwise_distance(X) and not pairwise_distance(X, X). Wildboar tries to be smart and computes the single triangle if X is Y, i.e., X and Y are the same object. However, it will not detect if (X == Y).all() (e.g., pairwise_distance(X, X.copy())) and will compute the full matrix.

If we pass a second array, then the returned matrix is the pairwise distance between the time series from both X and Y.

>>> pairwise_distance(X[:3], X[3:6])
array([[4.85497117, 5.96086309, 6.18777928],
      [2.00606825, 5.23060212, 4.27419835],
      [1.64445581, 6.38965963, 4.79102936]])

By default, the pairwise_distance function computes the Euclidean distance. We can change the metric by specifying the metric parameter as one of the supported metrics:

>>> pairwise_distance(X[:3], X[3:6], metric="edr")
array([[0.59756098, 0.47560976, 0.64634146],
      [0.08536585, 0.03658537, 0.13414634],
      [0.09756098, 0.25609756, 0.12195122]])

We can also specify optional extra parameters to the metric using the metric_params-parameter:

>>> pairwise_distance(X[:3], X[3:6], metric="twe", metric_params={"stiffness": 0.3})
array([[76.20881199, 73.62554784, 88.5536877 ],
      [27.49142159, 60.56024904, 50.24551102],
      [20.45513015, 81.60658533, 54.06099416]])

We support a multitude of input configurations and the return value of pairwise_distance depends on the input shape(s) of X (with x_samples) and Y (with y_samples) (and the dim parameter):

A single 1d-array `X`

Returns the scalar $0$.

A single 2d-array `X`

Returns a 2d-array with the distance between each sample in X of shape (x_samples, x_samples).

>>> pairwise_distance(X[0:2])
array([[0.        , 3.51158857],
      [3.51158857, 0.        ]])
A 1d-array `X` and a 1d-array `Y`

Returns a scalar with the distance of X to Y

>>> pairwise_distance(X[0], X[2])
5.11514381
A 1d-array `X` and a 2d-array `Y`

Returns a 1d-array of the distance between X and each sample in Y of shape (y_samples, ). If Y is a 1d-array and X is a 2d-array, a 1d-array, of shape (x_samples, ), with the distance between Y and each sample in X is returned.

>>> pairwise_distance(X[0], X[2:5])
array([5.11514381, 4.85497117, 5.96086309])
A 2d-array `X` and a 2d-array `Y`

Returns a 2d-array with the distance between each sample in X and each sample in Y of shape (x_samples, y_samples).

>>> pairwise_distance(X[0:2], X[2:4])
array([[5.11514381, 4.85497117],
      [2.35905618, 2.00606825]])
A 3d-array `X` and a 3d-array `Y`

Returns a 2d-array with the distance between each sample in X and each sample in Y of shape (x_samples, y_samples).

>>> x = X[0:6].reshape(2, 3, -1)
>>> y = X[6:12].reshape(2, 3, -1)
>>> pairwise_distance(x, y)
array([[5.48683192, 6.60301954],
      [4.34083722, 6.35954558]])

Note

If we set the parameter dim="full" we return a 3d-array of shape (n_dims, n_samples, n_timestep). Read more about multivariate support.

Paired distance#

Sometimes we do not need to compute the distance between every sample, but instead the distance between _pairs_ of samples. For this purpose, we use the paired_distance, which accepts two arrays X and Y with shapes (x_samples, x_timestep) and (y_samples, y_timesteps) with x_timestep == y_timestep for non-elastic metrics and X and Y that can be broadcast to a uniform shape.

>>> from wildboar.distance import paired_distance
>>> paired_distance(X[0:3], X[3:6])
array([4.85497117, 5.23060212, 4.79102936])
>>> paired_distance(X[0], X[3:6]) # Broadcasting
array([4.85497117, 5.96086309, 6.18777928])

Similar to pairwise_distance, we support the parameters metric and metric_params accepting the same input:

>>> paired_distance(X[0], X[3:6], metric="wdtw", metric_params={"g": 0.1})
array([0.50816474, 0.3299048 , 0.55193242])

paired_distance also supports a multitude of input configurations and the output depends on that configuration:

A 1d-array `X` and a 1d-array `Y`

Returns a scalar with the distance of X to Y.

Two arrays that can be broadcast to the same shape

Returns a 1d-array of shape (n_samples, ).

Note

If we set the parameter dim=”full” we return a 2d-array of shape (n_dims, n_samples). Refer to more about multivariate support for additional details.

Multivariate support#

As described, both paired_distance and pairwise_distance support multivariate time series by computing the “interdimensional” distance between time series and (by default) reporting the mean (dim=”mean”). Optionally, we can return the full distance matrix by setting dim="full":

>>> x = X[0:6].reshape(2, 3, -1)
>>> y = X[6:12].reshape(2, 3, -1)
>>> pairwise_distance(x, y, dim="full")
array([[[5.48683192, 6.60301954],
        [4.34083722, 6.35954558]],

      [[2.50507001, 0.90920635],
        [5.27646127, 4.60041068]],

      [[3.60786006, 3.75645164],
        [6.26677146, 7.24823344]]])

By setting dim="full", Wildboar returns the full array of distances between all dimensions. The returned array has the shape (n_dims, x_samples, y_samples). Similarly, we can compute the paired distance:

>>> paired_distance(x, y, dim="full")
array([[5.48683192, 6.35954558],
      [2.50507001, 4.60041068],
      [3.60786006, 7.24823344]])

Note that the paired_distance returns an array of shape (n_dims, n_samples).

If we are interested in the distance between a single dimension we can either slice the input data or slice the full distance matrix:

>>> d = pairwise_distance(x, y, dim="full")
>>> d[0]
array([[5.48683192, 6.60301954],
      [4.34083722, 6.35954558]])
>>> p = paired_distance(x, y, dim="full")
>>> p[0]
array([5.48683192, 6.35954558])

If we are only interested in a single dimension, we can set the dim parameter to the dimension we are want:

>>> pairwise_distance(x, y, dim=0)
array([[5.48683192, 6.60301954],
      [4.34083722, 6.35954558]]
>>> paired_distance(x, y, dim=0)
array([5.48683192, 6.35954558])

By setting dim to the desired dimension, we avoid computing the distance between unwanted dimensions.

Benchmark#

As part of the Wildboar test suite, we continuously investigate the performance of the estimators and, in particular since they are an integrate part of many tasks, the metrics. All Wildboar metrics, including subsequence and elastic metrics, are implemented in Cython for minimum CPU and memory utilization. Inspecting the relative performance of the different metrics and their theoretical time and space-complexity, we can see that the elastic metrics are typically two to three orders of magnitudes slower than the non-elastic metrics.

Metrics#

Metric name

Time

Space

Minimum

Maximum

Mean

chebyshev

\(O(n)\)

\(O(1)\)

1

1

1

angular

\(O(n)\)

\(O(1)\)

0.97686

0.87686

0.98368

cosine

\(O(n)\)

\(O(1)\)

0.98282

1.11131

0.98268

manhattan

\(O(n)\)

\(O(1)\)

0.95506

1.14157

0.96041

euclidean

\(O(n)\)

\(O(1)\)

0.94631

0.83231

0.92207

normalized_euclidean

\(O(n)\)

\(O(1)\)

0.55527

0.73285

0.55538

minkowski

\(O(n)\)

\(O(1)\)

0.41892

0.40887

0.42778

lcss

\(O(n^2)\)

\(O(n)\)

0.00061

0.00104

0.00064

wlcss

\(O(n^2)\)

\(O(n)\)

0.00054

0.00091

0.00056

edr

\(O(n^2)\)

\(O(n)\)

0.00030

0.00052

0.00032

ddtw

\(O(n^2)\)

\(O(n)\)

0.00028

0.00048

0.00029

dtw

\(O(n^2)\)

\(O(n)\)

0.00028

0.00048

0.00029

wddtw

\(O(n^2)\)

\(O(n)\)

0.00028

0.00048

0.00029

wdtw

\(O(n^2)\)

\(O(n)\)

0.00028

0.00048

0.00029

erp

\(O(n^2)\)

\(O(n)\)

0.00028

0.00048

0.00029

twe

\(O(n^2)\)

\(O(n)\)

0.00021

0.00035

0.00022

msm

\(O(n^2)\)

\(O(n)\)

0.00019

0.00032

0.00019

Subsequence metrics#

Metric name

Time

Space

Minimum

Maximum

Mean

chebyshev

\(O(m\times n)\)

\(O(1)\)

1

1

1

euclidean

\(O(m\times n)\)

\(O(1)\)

0.82372

0.49048

0.79202

manhattan

\(O(m\times n)\)

\(O(1)\)

0.66394

0.75967

0.67113

mass

\(O(m\times n)\)

\(O(1)\)

0.56287

0.61275

0.56196

scaled_euclidean

\(O(m\times n)\)

\(O(\text{max}(m; n))\)

0.49453

0.59627

0.49988

cosine

\(O(m\times n)\)

\(O(1)\)

0.42765

0.58025

0.43553

angular

\(O(m\times n)\)

\(O(1)\)

0.42761

0.58982

0.43474

minkowski

\(O(m\times n)\)

\(O(1)\)

0.21051

0.33659

0.21757

normalized_euclidean

\(O(m\times n)\)

\(O(1)\)

0.06851

0.10321

0.07092

lcss

\(O(m\times n^2)\)

\(O(\text{max}(m; n))\)

0.00216

0.00413

0.00226

scaled_dtw [1]

\(O(m\times n^2)\)

\(O(\text{max}(m; n))\)

0.00146

0.00278

0.00153

edr

\(O(m\times n^2)\)

\(O(\text{max}(m; n))\)

0.00102

0.00195

0.00107

ddtw

\(O(m\times n^2)\)

\(O(\text{max}(m; n))\)

0.00102

0.00194

0.00106

wddtw

\(O(m\times n^2)\)

\(O(\text{max}(m; n))\)

0.00099

0.00190

0.00104

dtw

\(O(m\times n^2)\)

\(O(\text{max}(m; n))\)

0.00099

0.00189

0.00103

wdtw

\(O(m\times n^2)\)

\(O(\text{max}(m; n))\)

0.00097

0.00185

0.00101

msm

\(O(m\times n^2)\)

\(O(\text{max}(m; n))\)

0.00096

0.00182

0.00100

erp

\(O(m\times n^2)\)

\(O(\text{max}(m; n))\)

0.00095

0.00181

0.00099

twe

\(O(m\times n^2)\)

\(O(\text{max}(m; n))\)

0.00071

0.00136

0.00074