.. currentmodule:: wildboar

#####################
Tree-based estimators
#####################

*****************************
Shapelet-based decision trees
*****************************
For classification purposes, Wildboar provides two variants of shapelet trees.
Both types of trees utilize random sampling of shapelets. The
:class:`tree.ShapeletTreeClassifier` selects shapelets in a random fashion,
whereas the :class:`tree.ExtraShapeletTreeClassifier` additionally samples the
distance threshold at random. Correspondingly, both
:class:`tree.ShapeletTreeRegressor` and
:class:`tree.ExtraShapeletTreeRegressor` are implemented for regression tasks.

.. execute::
   :context:
   :show-return:

   from wildboar.datasets import load_gun_point
   from wildboar.tree import ShapeletTreeClassifier

   X_train, X_test, y_train, y_test = load_gun_point(merge_train_test=False)
   clf = ShapeletTreeClassifier(random_state=1)
   clf.fit(X_train, y_train)


Since the trees are built using a (small) random sample of shapelets the
predictive performance of a simple tree is poor. The model above has an
accuracy of :execute:`"%.1f" % (clf.score(X_test, y_test) * 100)`\%.

We can increase the accuracy by increasing the size of the random sample:

.. execute::
   :context:
   :show-return:

   clf = ShapeletTreeClassifier(n_shapelets=100, random_state=1)
   clf.fit(X_train, y_train)

The new estimator has an accuracy of :execute:`"%.1f" % (clf.score(X_test,
y_test) * 100)`\%.

*****************************
Interval-based decision trees
*****************************
Interval-based decision trees uses values computed for contiguous intervals. For example, we can construct a decision tree that uses the mean of intervals:

.. execute::
   :context:
   :show-return:

   from wildboar.tree import IntervalTreeClassifier

   clf = IntervalTreeClassifier(summarizer="mean", random_state=1)
   clf.fit(X_train, y_train)

The classifier chooses the interval with the `mean`-value that optimizes the
impurity criterion. The resulting estimator has an accuracy of :execute:`"%.1f"
% (clf.score(X_test, y_test) * 100)`\%.

We can also build the tree either of the `mean`, `variance` or `slope` of the intervals:

.. execute::
   :context:
   :show-return:

   clf = IntervalTreeClassifier(summarizer="mean_var_slope", random_state=1)
   clf.fit(X_train, y_train)

The resulting estimator has an accuracy of :execute:`"%.1f" %
(clf.score(X_test, y_test) * 100)`\%.

We can also use the `catch22` features:


.. execute::
   :context:
   :show-return:

   clf = IntervalTreeClassifier(summarizer="catch22", random_state=1)
   clf.fit(X_train, y_train)

The resulting estimator has an accuracy of :execute:`"%.1f" %
(clf.score(X_test, y_test) * 100)`\%.

.. warning::

   If we compute more than one attribute value per interval, each node is split
   using a randomly drawn attribute value. The trees are more suitable in
   ensembles than as single trees.

One can also specify a list of `callable` functions that compute the attribute
value for an interval:

.. execute::
   :context:
   :show-return:

   import numpy as np

   clf = IntervalTreeClassifier(summarizer=[np.mean], random_state=1)
   clf.fit(X_train, y_train)

The resulting estimator has an accuracy of :execute:`"%.1f" %
(clf.score(X_test, y_test) * 100)`\%.

.. warning::

   Wildboar trees are able to release the global interpreter lock unless a
   callable is passed. Using callable summarizers result in degraded
   performance.

***********************
Decision tree structure
***********************
All decision tree estimators in Wildboar has an attribute ``tree_`` which
contains the low-level representation of the tree. The tree contains several
parallel arrays that store the entire structure of the decision tree, for
example the property ``tree_.max_depth`` records the maximal depth of the
decision tree and `tree_.node_count` store the total number of (branch, leaf)
nodes.

The arrays are indexed from ``[0, tree_.node_count)`` where `0` indicates the
root node of the tree and the other indices refers to either a leaf or a branch
node. We use the same array for both branches and leafs, meaning that some
values are undefined for leafs or branches depending on its use. For example,
the ``tree_.value`` records the values of leaf nodes, as such if the node is a
branch the corresponding value is undefined. Similarly, the ``tree_.threshold``
array records the threshold used to partition the samples and is undefined if
the node is a leaf-node.

The following arrays are available:

- ``left``: the id of the left node, e.g., ``left[0]`` corresponds to the left
  node after the root. The value is ``-1`` if the node is a leaf.
- ``right``: the id of the right node, e.g., ``right[0]`` corresponds to the right
  node after the root. The value is ``-1`` if the node is a leaf.
- ``threshold``: the threshold at a given node. The value is undefined for leaf
  nodes.
- ``attribute``: the attribute at a given node. The value is ``None`` for leaf
  nodes and a tuple ``(dim, attr)`` for branch nodes. The ``attr`` value
  depends on the decision tree.
- ``value``: the output value for the node id. The array has the shape
  ``(node_count, n_labels)`` where ``n_labels==1`` for regression. The j:th
  column refers to the j:th class as present in the ``classes_`` attribute. The
  value is undefined for branch nodes.
- ``impurity``: the impurity at the node.
- ``n_node_samples``: the number of samples reaching the node.
- ``weighted_n_node_samples``: the number of `wheighted` samples reaching the
  node. The same as ``n_node_samples`` unless ``sample_weight`` or
  ``class_weight`` is set.

We can compute various properties of the trees using these arrays. For example,
we can obtain the decision path, that is the nodes a sample traverses or the
terminal leaf reached by samples. Both methods are already provided by
Wildboar as the methods
:meth:`~wildboar.tree.ShapeletTreeClassifier.decision_path` and
:meth:`~wildboar.tree.ShapeletTreeClassifier.apply`. First we construct a
decision tree:

.. execute::
   :context:
   :show-return:

   clf = ShapeletTreeClassifier(
      n_shapelets=10,
      impurity_equality_tolerance=0.0,
      max_shapelet_size=0.4,
      metric="scaled_euclidean",
      random_state=1,
   )
   clf.fit(X_train, y_train)

The decision tree has an accuracy of :execute:`"%.1f" % (clf.score(X_test,
y_test) * 100)`\%.

Next, the following code sample shows the decision rules used to reach a
prediction for a sample:

.. execute::
   :context:
   :show-output: code

   from wildboar.distance import pairwise_subsequence_distance
   node_indicator = clf.decision_path(X_test)
   leaf_id = clf.apply(X_test)
   attribute = clf.tree_.attribute
   threshold = clf.tree_.threshold
   value = clf.tree_.value

   sample_id = 100
   # obtain ids of the nodes `sample_id` goes through, i.e., row `sample_id`
   node_index = node_indicator.indices[
      node_indicator.indptr[sample_id] : node_indicator.indptr[sample_id + 1]
   ]

   print("Rules used to predict sample {id}:\n".format(id=sample_id))
   for node_id in node_index:
      # continue to the next node if it is a leaf node
      if leaf_id[sample_id] == node_id:
         continue

      # check if value of the split feature for sample 0 is below threshold
      (dim, (_, shapelet)) = attribute[node_id]
      dist = pairwise_subsequence_distance(shapelet, X_test[sample_id], metric="scaled_euclidean")
      if  dist <= threshold[node_id]:
         threshold_sign = "<="
      else:
         threshold_sign = ">"

      print(
         f"decision node {node_id} : (D(X_test[{sample_id}], S[{node_id}]) = {dist}) "
         f"{threshold_sign} {threshold[node_id]})"
      )

Plotting
========
We can plot shapelet-based decision trees using the
:func:`~wildboar.tree.plot_tree` function.

.. execute::
   :context:

   from wildboar.tree import plot_tree
   plot_tree(clf)

In the tree decision tree, we observe three branching nodes and four
terminal nodes. Each node displays the count of samples that have reached it.
For example, the initial branching node is reached by the entire training set
(50 samples), while the leftmost terminal node is reached by 22 samples. The
branching nodes define the decision criteria that determine the path to follow:
the left path is taken if the distance between the sample and the shapelet is
less than the threshold, and the right path is taken if the distance exceeds
the threshold. This criterion is represented in the node by the inequality
`D[X, S[0]] <= 0.91`, which signifies that if the distance between the sample
and the first shapelet is less than `0.91`, the left path is chosen (conversely,
the right path is chosen if the distance is greater). Upon reaching a terminal
node, the prediction, denoted by `Pred: y`, is assigned to the sample.

We can also visualize the samples together with the tree:

.. execute::
   :context:
   :include-source: no
   :show-source-link:

   import numpy as np
   import matplotlib.pylab as plt

   fig = plt.figure(figsize=(8, 4))
   arrays = [(i, attr[1][1]) for i, attr in enumerate(clf.tree_.attribute) if attr is not None]
   spec = fig.add_gridspec(len(arrays), 5)

   left = fig.add_subplot(spec[:, :4])
   plot_tree(clf, ax=left, fontsize=None)

   cmap = plt.get_cmap(lut=len(arrays))
   for i, (index, arr) in enumerate(arrays):
      ax = fig.add_subplot(spec[i, 4])
      ax.plot(arr, label=f"S[{index}]", color=cmap(i))
      ax.set_xlim([len(arr)-X_test.shape[1], X_test.shape[1]])
      ax.set_ylim([np.min(X_train), np.max(X_train)])
      ax.set_axis_off()

   fig.legend()