\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

On marginal feature attributions of tree-based models

  • *First & corresponding author: Khashayar Filom

    *First & corresponding author: Khashayar Filom 
Abstract / Introduction Full Text(HTML) Figure(10) / Table(8) Related Papers Cited by
  • Due to their power and ease of use, tree-based machine learning models, such as random forests and gradient-boosted tree ensembles, have become very popular. To interpret them, local feature attributions based on marginal expectations, e.g. marginal (interventional) Shapley, Owen or Banzhaf values, may be employed. Such methods are true to the model and implementation invariant, i.e. dependent only on the input-output function of the model. We contrast this with the popular TreeSHAP algorithm by presenting two (statistically similar) decision trees that compute the exact same function for which the "path-dependent" TreeSHAP yields different rankings of features, whereas the marginal Shapley values coincide. Furthermore, we discuss how the internal structure of tree-based models may be leveraged to help with computing their marginal feature attributions according to a linear game value. One important observation is that these are simple (piecewise-constant) functions with respect to a certain grid partition of the input space determined by the trained model. Another crucial observation, showcased by experiments with XGBoost, LightGBM and CatBoost libraries, is that only a portion of all features appears in a tree from the ensemble. Thus, the complexity of computing marginal Shapley (or Owen or Banzhaf) feature attributions may be reduced. This remains valid for a broader class of game values which we shall axiomatically characterize. A prime example is the case of CatBoost models where the trees are oblivious (symmetric) and the number of features in each of them is no larger than the depth. We exploit the symmetry to derive an explicit formula, with improved complexity and only in terms of the internal model parameters, for marginal Shapley (and Banzhaf and Owen) values of CatBoost models. This results in a fast, accurate algorithm for estimating these feature attributions.

    Mathematics Subject Classification: Primary: 68T01, 91A12, 91A80, 05A19; Secondary: 91A68, 91A06, 05C05.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  The picture for Example 3.1 demonstrating that TreeSHAP [45] can depend on the model make-up. Here, the features $ X_1 $ and $ X_2 $ are supported in the rectangle $ \mathcal{B} = [-1, 1]\times[-1, 1] $ on the right which is partitioned into subrectangles $ R_1 = R_1^{ { - }}\cup R_1^{ { + }} $, $ R_2 $ and $ R_3 $. The decision trees $ T_1 $ and $ T_2 $ on the left compute the same function $ g = c_1\cdot \mathbb{1}_{R_1}+c_2\cdot\mathbb{1}_{R_2}+c_3\cdot\mathbb{1}_{R_3} $; the leaves are colored based on the colors of corresponding subrectangles on the right. Shapley values for various games associated with these trees are computed in Example 3.1. In particular, over each of the subrectangles, the Shapley values arising from the marginal game, or from TreeSHAP, are constant expressions in terms of probabilities of $ {\rm{P}}_\mathbf{X}(R_1^{ { - }}), {\rm{P}}_\mathbf{X}(R_1^{ { + }}), {\rm{P}}_\mathbf{X}(R_2), {\rm{P}}_\mathbf{X}(R_3) $ and leaf values $ c_1, c_2, c_3 $; see Table 2. Although the former Shapley values depend only $ g $, the latter turn out to be different for $ T_1 $ and $ T_2 $. In (22), these parameters are chosen so that TreeSHAP ranks features $ X_1 $ and $ X_2 $ differently for any input from $ R_2 $, whereas the decision trees compute the same function and are almost indistinguishable in terms of impurity measures

    Figure 2.  The partition determined by a decision tree $ T $ and its completion into a grid are demonstrated. The tree on the left implements a simple function $ g = g(x_1, x_2) $ supported in $ \mathcal{B} = [0, 4]\times[0, 3] $. In the corresponding partition $ \mathscr{P}(T) $ of $ \mathcal{B} $ on the right, $ g $ is constant on the smaller subrectangles each corresponding to the leaf of the same color. Here, the tree is not oblivious and $ \mathscr{P}(T) $ is not a grid. Adding the dotted lines refines $ \mathscr{P}(T) $ to a three by four grid $ \widetilde{\mathscr{P}(T)} $. On each piece of the grid, the associated marginal game (and thus its Shapley values) is almost surely constant with a value which is an expression in terms of probabilities $ {\rm{P}}_{\mathbf{X}}(\tilde{R})\, (\tilde{R}\in \widetilde{\mathscr{P}(T)}) $ (cf. Theorem 3.2); these in general cannot be retrieved from the trained tree unless it is oblivious

    Figure 3.  The picture for Example 3.10 which is concerned with the oblivious decision tree $ T $ of depth three appearing on the left. At each split, we go right if the feature is larger than the threshold and go left otherwise. The leaves can thus be encoded with elements of $ \{0, 1\}^3 $; the same holds for the regions into which the tree, as on the right, partitions the rectangle $ \mathcal{B} = [0, 3]\times[0, 2] $ where the features $ (X_1, X_2) $ are supported. But two of the binary codes, $ 001 $ and $ 011 $, do not amount to any region since the paths from the root to their corresponding leaves encounter conflicting thresholds for feature $ X_1 $. Any other leaf of $ T $ corresponds to the region of the same color on the right

    Figure 4.  The execution times for the two steps of Algorithm 3.12 are depicted for CatBoost models of various depths which were trained on synthetic data (49) for our experiment in Section 4.1. The plot on the left illustrates the training and test errors for these models. The one in the middle shows the average time it took to precompute the Shapley values for a tree from the ensemble in the logarithmic scale. Finally, the last plot captures the on-the-fly computation time for obtaining the Shapley values of 1,000 random data point based on the precomputed tables

    Figure 5.  The execution times are plotted for the interventional TreeSHAP—using the CatBoost's built-in method $\texttt{get_feature_importance(shap_calc_type="Regular")}$ and background datasets $ D_* $ of various sizes—and the two steps of our proposed algorithm 3.12—which requires no background dataset and its accuracy is dictated by the training set $ D $—once applied to CatBoost models trained for Section 4.1. The plots confirm the complexity analysis outlined in Table 1

    Figure 6.  The figure benchmarks the precomputation time per tree for our algorithm (Step 1 of Algorithm 3.12) with the per tree execution time for the CatBoost's built-in method $\texttt{get_feature_importance(shap_calc_type="Exact", shap_mode="UsePreCalc")}$ in both standard and logarithmic scales

    Figure 7.  An analog of Figure 4 for a generalization of Algorithm 3.12 to the case of marginal Owen values based on formula (119). The plots show Owen values precomputation and computation times for CatBoost ensembles with $\texttt{max_depth}$ varying from $ 4 $ to $ 12 $ trained on synthetic data (49) with a fixed underlying partition of the $ n = 40 $ predictors into $ 8 $ groups of size $ 5 $. The first plot is in the logarithmic scale

    Figure 8.  The picture for Example C.2 demonstrating that eject TreeSHAP [8] can depend on the model make-up. For the two decision trees on the left, the splits are the same but occur in different orders. The trees compute the same function $ g = c_1\cdot\mathbb{1}_{[-1, 0]\times[-1, 0]} +c_2\cdot\mathbb{1}_{[-1, 0]\times[0, 1]} +c_3\cdot\mathbb{1}_{[0, 1]\times[-1, 0]} +c_1\cdot\mathbb{1}_{[0, 1]\times[0, 1]} $ and determine the partition on the right of $ \mathcal{B} = [-1, 1]\times[-1, 1] $, where the features are supported, into four subsquares. Under the assumption that these subsquares are equally probable, the associated eject TreeSHAP games (cf. Definition C.1) are presented in (75). The rankings of features based on the Shapley values of these games are never the same for inputs from the top-left subsquare (unless $ c_2 = c_3 $)

    Figure 9.  The picture illustrating the construction outlined in Appendix G which to a decision tree $ T $ assigns an oblivious decision tree $ {\rm{obl}}(T) $ computing the same function. Here, the features $ X_1 $ and $ X_2 $ are supported in the square $ \mathcal{B} = [0, 3]\times [0, 3] $. On the top, a decision tree $ T $ is demonstrated along with its induced partition $ \mathscr{P}(T) $ of $ \mathcal{B} $ where each piece of $ \mathscr{P}(T) $ has the same color as the corresponding leaf of $ T $. On the bottom, the oblivious decision tree $ {\rm{obl}}(T) $ is shown which, at each level, splits on the feature and threshold appearing at an internal node of $ T $. (Each internal node of $ {\rm{obl}}(T) $ is colored similarly as its corresponding internal node from $ T $.) At each split of $ {\rm{obl}}(T) $, we go right if the feature is larger than the threshold and go left otherwise; hence an encoding of leaves of $ {\rm{obl}}(T) $ with binary codes. The leaves which are colored are realizable (i.e. no conflicting thresholds along the path to them), and are in a bijection with elements of the grid partition $ \widetilde{\mathscr{P}(T)} $ of $ \mathcal{B} $. (For an element of $ \widetilde{\mathscr{P}(T)} $, the colors of the corresponding leaf of $ {\rm{obl}}(T) $ and the element of the coarser partition $ \mathscr{P}(T) $ containing it coincide.)

    Figure 10.  The picture for Example H.1 illustrating a piecewise-linear function $ f $ computed by a ReLU network. The features $ (X_1, X_2) $ are supported in $ \mathcal{B} = [0, 1]\times[0, 1] $ and $ f(x_1, x_2) = (x_1-x_2)^+ $. The Shapley values become non-linear; see (125)

    Table 1.  The complexity of various explanation algorithms are compared for an ensemble $ \mathcal{T} $ of oblivious decision trees where each tree has at most $ \mathcal{L} $ leaves. For an oblivious decision tree, the path-dependent TreeSHAP and the marginal Shapley values are the same for data points ending up at the same leaf; cf. Theorem 3.2. Algorithm 3.12 first precomputes all marginal Shapley values for all leaves of all trees and then saves them as look-up tables. The total time complexity is $ O\left(|\mathcal{T}|\cdot \mathcal{L}^{\log_2 6}\cdot\log(\mathcal{L})\right) $, and the memory required to store them is $ O(|\mathcal{T}|\cdot \mathcal{L}\cdot\log(\mathcal{L})) $. As long as the model is in production, the saved tables can be used to estimate marginal Shapley for any individual in time $ O(|\mathcal{T}|\cdot\log(\mathcal{L})) $. On the other hand, variants of TreeSHAP do not build look-up tables and their time complexity for one individual (i.e. one leaf of each tree) are reflected above. As for the accuracy in estimating marginal Shapley values, the path-dependent TreeSHAP in general does not converge to marginal Shapley values (cf. Example 3.1) while the interventional TreeSHAP has small error only for large background datasets $ D_* $. In contrast, Algorithm 3.12 does not require any background dataset and instead utilizes model parameters such as leaf weights which are based on the training set $ D $ (which is typically very large)

    Algorithm Precomputation complexity (per leaf) Computation complexity (for an input explicand) Variance of error is governed by
    Path-dependent TreeSHAP [45] N/A $ O\left(|\mathcal{T}|\cdot \mathcal{L} \cdot \log^2(\mathcal{L})\right) $ N/A
    Interventional TreeSHAP [44] N/A $ O(|\mathcal{T}|\cdot\mathcal{L} \cdot |D_*|) $ $ \frac{1}{|D_*|} $
    Algorithm 3.12 $ O\left(|\mathcal{T}|\cdot \mathcal{L}^{\log_2 3}\cdot\log(\mathcal{L})\right) $ $ O(|\mathcal{T}|\cdot\log(\mathcal{L})) $ $ \frac{1}{|D|} $
     | Show Table
    DownLoad: CSV

    Table 2.  The table for Example 3.1 where $ \mathbf{X} = (X_1, X_2) $ are the predictors, and two decision trees $ T_1 $ and $ T_2 $ computing the same function $ g $ are considered as in Figure 1. For different games associated with $ \mathbf{X} $, $ g $, $ T_1 $ and $ T_2 $, the table captures the values of $ 2\Delta\varphi $ over various parts of the input space. Here, $ \Delta\varphi $ is the difference $ \varphi_1-\varphi_2 $ of the Shapley values of the game under consideration (see (18)). It is observed that the first two rows only depend on $ (\mathbf{X}, g) $ while on the last two rows, for TreeSHAP games, $ 2\Delta\varphi $ differs for $ T_1 $ and $ T_2 $. For a choice of parameters such as (22), on $ R_2 $ one has $ \varphi_1 < \varphi_2 $ for $ T_1 $ and $ \varphi_1 > \varphi_2 $ for $ T_2 $; hence inconsistent rankings of features by TreeSHAP

    on $ R_1^{ { - }} $ on $ R_1^{ { + }} $ on $ R_2 $ on $ R_3 $
    $ v^ {CE}(\cdot; \mathbf{X}, g)(\mathbf{x}) $ $ (c_2-c_1)\cdot\alpha(x_1) $ $ (c_3-c_1)\cdot\alpha(x_1) $ $ \begin{matrix}(c_1-c_2)\cdot(1-\alpha(x_1))\\+(c_2-c_3)\cdot\beta(x_2) \end{matrix} $ $ \begin{matrix} (c_1-c_3)\cdot(1-\alpha(x_1))\\+(c_3-c_2)\cdot(1-\beta(x_2)) \end{matrix} $
    $ v^ {ME}(\cdot; \mathbf{X}, g)(\mathbf{x}) $ $ (c_2-c_1)(p_2+p_3) $ $ (c_3-c_1)(p_2+p_3) $ $ \begin{matrix}(c_1-c_2)p_1\\ +(c_2-c_3)(p_1^{ { + }}+p_3) \phantom{a} \end{matrix} $ $ \begin{matrix}(c_1-c_3)p_1\\ +(c_3-c_2)(p_1^{ { - }}+p_2) \phantom{a} \end{matrix} $
    $ v^ {Tree}(\cdot; T_1)(\mathbf{x}) $ $ (c_2-c_1)(\hat{p}_2+\hat{p}_3) $ $ (c_3-c_1)(\hat{p}_2+\hat{p}_3) $ $ \begin{matrix} c_1\hat{p}_1+c_2(\hat{p}_2+\hat{p}_3)\\ -\left(c_2\frac{\hat{p}_2}{\hat{p}_2+\hat{p}_3}+c_3\frac{\hat{p}_3}{\hat{p}_2+\hat{p}_3}\right) \end{matrix} $ $ \begin{matrix} c_1\hat{p}_1+c_3(\hat{p}_2+\hat{p}_3)\\ -\left(c_2\frac{\hat{p}_2}{\hat{p}_2+\hat{p}_3}+c_3\frac{\hat{p}_3}{\hat{p}_2+\hat{p}_3}\right) \end{matrix} $
    $ v^ {Tree}(\cdot; T_2)(\mathbf{x}) $ $ (c_2-c_1)\frac{\hat{p}_2}{\hat{p}_1^{ { - }}+\hat{p}_2} $ $ (c_3-c_1)\frac{\hat{p}_3}{\hat{p}_1^{ { + }}+\hat{p}_3} $ $ \begin{matrix}\left(c_1\frac{\hat{p}_1^{ { - }}}{\hat{p}_1^{ { - }}+\hat{p}_2}+c_2\frac{\hat{p}_2}{\hat{p}_1^{ { - }}+\hat{p}_2}\right) \\-\left(c_2(\hat{p}_1^{ { - }}+\hat{p}_2)+c_3(\hat{p}_1^{ { + }}+\hat{p}_3)\right) \end{matrix} $ $ \begin{matrix} \left(c_1\frac{\hat{p}_1^{ { + }}}{\hat{p}_1^{ { + }}+\hat{p}_3}+c_3\frac{\hat{p}_3}{\hat{p}_1^{ { + }}+\hat{p}_3}\right) \\ -\left(c_2(\hat{p}_1^{ { - }}+\hat{p}_2)+c_3(\hat{p}_1^{ { + }}+\hat{p}_3)\right) \end{matrix} $
     | Show Table
    DownLoad: CSV

    Table 3.  Table required for Example 3.10 where the marginal Shapley values are computed for the tree illustrated in Figure 3. All choices in (40) for nested subsets $ Z\subseteq W $ are outlined along their ramifications to the second bit of the binary codes, and the respective weights (cf. (33)) which appear in the formula

    $ Z $ $ W $ $ \omega^+ $ $ Z $ $ W $ $ \omega^- $
    $ u_2=b_2, \, b_2\neq a_2 $ $ \{1\} $ $ \{1\} $ $ \frac{1}{2} $ $ \varnothing $ $ \varnothing $ $ \frac{1}{2} $
    $ u_2\neq b_2, \, b_2=a_2 $ $ \{1, 2\} $ $ \{1, 2\} $ $ \frac{1}{2} $ $ \{2\} $ $ \{2\} $ $ \frac{1}{2} $
    $ u_2=b_2, \, b_2=a_2 $ $ \{1\} $ $ \{1, 2\} $ $ 1 $ $ \varnothing $ $ \{2\} $ $ 1 $
     | Show Table
    DownLoad: CSV

    Table 4.  Datasets used for experiments in Section 4

    Dataset Task Features Train Validation Test
    Superconductivity [28] Regression 81 12,757 4,253 4,253
    Ailerons [70] Regression 40 5,723 1,431 6,596
    Online News [23] Classification 58 23,786 7,929 7,929
    Higgs [73] Classification 28 10,000,000 500,000 500,000
     | Show Table
    DownLoad: CSV

    Table 5.  The performance over the test set for the CatBoost, LightGBM and XGBoost models trained for regression tasks (top) and for classification tasks (bottom) are presented here. (See Table 4 for the underlying datasets.) Here, $ \mathbf{y}_{\rm{test}} $ denotes the correct target values/labels and $ \mathbf{y}_{\rm{cat}} $, $ \mathbf{y}_{\rm{lgbm}} $, $ \mathbf{y}_{\rm{xgb}} $ are the predicted vectors over the test set (predicted probabilities in the case of classification). Each cell captures a metric between the vectors associated with the corresponding row and column. The cells in teal denote $ R^2 $ scores while those in yellow denote AUC scores (only relevant for the classification tasks—the table on the bottom). In both tables, the predicted vectors $ \mathbf{y}_{\rm{cat}} $, $ \mathbf{y}_{\rm{lgbm}} $, $ \mathbf{y}_{\rm{xgb}} $ are compared based on the $ R^2 $ score

     | Show Table
    DownLoad: CSV

    Table 6.  Various quantities associated with the ensembles trained on datasets from Table 4

            ● $|\mathcal{T}|$: number of trees in the ensemble, $\bullet$ $n$: the total number of features,
            ● $\overline{\left|\mathcal{T}^{(i)}\right|}$: the average number of trees in the ensemble that split on a feature,
            ● $\overline{m(T)}$: the average depth of trees, $\bullet$ $\overline{\ell(T)}$: the average number of leaves,
            ● $\overline{k(T)}$: the average number of distinct features per tree.
    Asterisk indicates that the quantity whose average is under consideration does not vary across the ensemble.
    Dataset model type $ |\mathcal{T}| $ $ \overline{\left|\mathcal{T}^{(i)}\right|} $ $ \overline{\ell(T)} $ $ \overline{m(T)} $ $ \overline{k(T)} $ $ n $
    Superconductivity CatBoost 300 27.70 256$ ^* $ 8$ ^* $ 7.48 81
    LightGBM 300 85.48 31$ ^* $ 11.48 23.08
    XGBoost 300 92.69 42.21 6$ ^* $ 25.03
    Ailerons CatBoost 50 7.50 126.72 6.98 6.00 40
    LightGBM 50 14.00 25$ ^* $ 8.42 11.20
    XGBoost 40 1.98 3.78 1.85 1.98
    Online News CatBoost 167 19.86 128$ ^* $ 7$ ^* $ 6.90 58
    LightGBM 88 32.78 31$ ^* $ 10.41 21.60
    XGBoost 88 26.98 27.09 5$ ^* $ 17.78
    Higgs CatBoost 1000 251.32 256$ ^* $ 8$ ^* $ 7.04 28
    LightGBM 1000 636.68 60$ ^* $ 13.18 17.83
    XGBoost 1000 524.93 31.89 5$ ^* $ 14.70
     | Show Table
    DownLoad: CSV

    Table 7.  The execution times of an optimized implementation of Algorithm 3.12 once applied to explain CatBoost models from Section 4.2 on an AWS machine with 16 cores and 128GB of memory. The algorithm first precomputes a look-up table for each decision tree. In each case, the total time required to construct all of the tables, both with or without multithreading, is recorded along with the total size of these tables. The last column captures the time it took to compute marginal Shapley values for 100 randomly chosen test instances based on these look-up tables

    Dataset Features Trees Precomputation (no multithreading) Precomputation (with multithreading) Total Size Computation (100 samples)
    Superconductivity 81 300 20.118s 6.197s 10.6MB 0.030s
    Ailerons 40 50 0.600s 0.367s 808KB 0.005s
    Online News 58 167 3.518s 1.888s 3.19MB 0.017s
    Higgs 28 1000 51.140s 19.152s 32.3MB 0.068s
     | Show Table
    DownLoad: CSV

    Table 8.  The analog of Table 7 for computing marginal Owen values of CatBoost ensembles from Section 4.2 with a proprietary code based on formula (119). The code was run on an AWS machine with 16 cores and 128GB of memory. For each dataset, the features were partitioned through a hierarchical clustering process based on a sophisticated measure of dependence developed in [55]

    Dataset Features Trees Partition Size Precomputation (no multithreading) Precomputation (with multithreading) Total Size Computation (100 samples)
    Superconductivity 81 300 11 32.022s 10.138s 10.6MB 0.034s
    Ailerons 40 50 24 2.059s 1.025s 808KB 0.006s
    Online News 58 167 37 12.827s 3.715s 3.19MB 0.015s
    Higgs 28 1000 27 252.005s 61.908s 32.2MB 0.068s
     | Show Table
    DownLoad: CSV
  • [1] K. Aas, M. Jullum and A. Løland, Explaining individual predictions when features are dependent: More accurate approximations to Shapley values, Artificial Intelligence, 298 (2021), Paper No. 103502, 24 pp. doi: 10.1016/j.artint.2021.103502.
    [2] E. Al Daoud, Comparison between XGBoost, lightGBM and CatBoost using a home credit dataset, International Journal of Computer and Information Engineering, 13 (2019), 6-10. 
    [3] D. Alvarez Melis and T. Jaakkola, Towards robust interpretability with self-explaining neural networks, Advances in Neural Information Processing Systems, 31 (2018).
    [4] S. I. Amoukou, T. Salaün and N. Brunel, Accurate Shapley values for explaining tree-based models, in International Conference on Artificial Intelligence and Statistics, PMLR, (2022), 2448-2465.
    [5] R. J. Aumann and J. H. Dréze, Cooperative games with coalition structures, International Journal of Game Theory, 3 (1974), 217-237.  doi: 10.1007/BF01766876.
    [6] J. F. Banzhaf III, Weighted voting doesn't work: A mathematical analysis, Rutgers L. Rev., 19 (1964), 317. 
    [7] L. Breiman, Random forests, Machine Learning, 45 (2001), 5-32.  doi: 10.1023/A:1010933404324.
    [8] T. W. CampbellH. RoderR. W. Georgantas III and J. Roder, Exact Shapley values for local and model-true explanations of decision tree ensembles, Machine Learning with Applications, 9 (2022), 100345.  doi: 10.1016/j.mlwa.2022.100345.
    [9] R. Caruana and A. Niculescu-Mizil, An empirical comparison of supervised learning algorithms, in Proceedings of the 23rd International Conference on Machine Learning, (2006), 161-168. doi: 10.1145/1143844.1143865.
    [10] H. Chen, I. C. Covert, S. M. Lundberg and S.-I. Lee, Algorithms to estimate Shapley value feature attributions, arXiv e-prints, 2022. arXiv: 2207.07605v1.
    [11] H. Chen, J. D. Janizek, S. Lundberg and S.-I. Lee, True to the model or true to the data?, arXiv e-prints, 2020. arXiv: 2006.16234.
    [12] H. ChenS. Lundberg and S.-I. Lee, Explaining models by propagating Shapley values of local components, Explainable AI in Healthcare and Medicine: Building a Culture of Transparency and Accountability, 914 (2021), 261-270.  doi: 10.1007/978-3-030-53352-6_24.
    [13] T. Chen and C. Guestrin, XGBoost: A scalable tree boosting system, in Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, (2016), 785-794. doi: 10.1145/2939672.2939785.
    [14] A. Chopra and P. Bhilare, Application of ensemble models in credit scoring models, Business Perspectives and Research, 6 (2018), 129-141.  doi: 10.1177/2278533718765531.
    [15] A. Datta, S. Sen, and Y. Zick, Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems, in 2016 IEEE Symposium on Security and Privacy (SP), IEEE, (2016), 598-617. doi: 10.1109/SP.2016.42.
    [16] T. G. Dietterich, Ensemble methods in machine learning, in International Workshop on Multiple Classifier Systems, Springer, 1857 (2000), 1-15. doi: 10.1007/3-540-45014-9_1.
    [17] Documentation, CatBoost. https://catboost.ai/en/docs.
    [18] Documentation, LightGBM. https://lightgbm.readthedocs.io/en/latest/index.html.
    [19] Documentation, TreeSHAP. https://shap-lrjball.readthedocs.io/en/latest/generated/shap.TreeExplainer.html.
    [20] Documentation, XGBoost. https://xgboost.readthedocs.io/en/stable.
    [21] A. Dorogush, V. Ershov and A. Gulin, CatBoost: gradient boosting with categorical features support, arXiv e-prints, 2018. arXiv: 1810.11363.
    [22] ECOA, Equal Credit Opportunity Act. https://www.justice.gov/crt/equal-credit-opportunity-act-3.
    [23] K. Fernandes, P. Vinagre, P. Cortez and P. Sernadela, Online News Popularity, UCI Machine Learning Repository, 2015. doi: 10.24432/C5NS3V.
    [24] M. Ferov and M. Modrý, Enhancing LambdaMART using oblivious trees, arXiv e-prints, 2016. arXiv: 1609.05610.
    [25] FHA, Fair Housing Act. https://www.justice.gov/crt/fair-housing-act-1.
    [26] J. H. Friedman, Greedy function approximation: A gradient boosting machine., Annals of Statistics, 29 (2001), 1189-1232.  doi: 10.1214/aos/1013203451.
    [27] L. Grinsztajn, E. Oyallon and G. Varoquaux, Why do tree-based models still outperform deep learning on typical tabular data?, in Thirty-Sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
    [28] K. Hamidieh, Superconductivty Data, UCI Machine Learning Repository, 2018. doi: 10.24432/C53P47.
    [29] J. T. Hancock and T. M. Khoshgoftaar, CatBoost for big data: An interdisciplinary review, Journal of Big Data, 7 (2020), 1-45. 
    [30] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2, Springer, 2009. doi: 10.1007/978-0-387-84858-7.
    [31] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al., Practical lessons from predicting clicks on ads at facebook, in Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, (2014), 1-9. doi: 10.1145/2648584.2648589.
    [32] L. HuJ. ChenJ. VaughanS. AramidehH. YangK. WangA. Sudjianto and V. N. Nair, Supervised machine learning techniques: An overview with applications to banking, International Statistical Review, 89 (2021), 573-604.  doi: 10.1111/insr.12448.
    [33] D. Janzing, L. Minorics and P. Blöbaum, Feature relevance quantification in explainable AI: A causal problem, in International Conference on Artificial Intelligence and Statistics, PMLR, (2020), 2907-2916.
    [34] B. John, When to Choose CatBoost Over XGBoost or LightGBM [Practical Guide]?, (neptun.ai), 2022. https://neptune.ai/blog/when-to-choose-catboost-over-xgboost-or-lightgbm.
    [35] M. Jullum, A. Redelmeier and K. Aas, GroupShapley: Efficient prediction explanation with Shapley values for feature groups, arXiv e-prints, 2021. arXiv: 2106.12228.
    [36] U. Kamath and J. Liu, Explainable Artificial Intelligence: An Introduction to Interpretable Machine Learning, Springer, 2021. doi: 10.1007/978-3-030-83356-5.
    [37] Y. Kamijo, A two-step Shapley value for cooperative games with coalition structures, International Game Theory Review, 11 (2009), 207-214.  doi: 10.1142/S0219198909002261.
    [38] A. Karczmarz, T. Michalak, A. Mukherjee, P. Sankowski and P. Wygocki, Improved feature importance computation for tree models based on the Banzhaf value, in The 38th Conference on Uncertainty in Artificial Intelligence, 2022.
    [39] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye and T.-Y. Liu, LightGBM: A highly efficient gradient boosting decision tree, Advances in Neural Information Processing Systems, 30 (2017).
    [40] L. Koralov and Y. Sinai, in Theory of Probability and Random Processes, Springer, 2007. doi: 10.1007/978-3-540-68829-7.
    [41] K. Kotsiopoulos, A. Miroshnikov, K. Filom and A. R. Kannan, Approximation of group explainers with coalition structure using Monte Carlo sampling on the product space of coalitions and features, arXiv e-prints, 2023. arXiv: 2303.10216.
    [42] H. Lakkaraju, E. Kamar, R. Caruana and J. Leskovec, Interpretable & explorable approximations of black box models, arXiv e-prints, 2017. arXiv: 1707.01154.
    [43] Y. Lou, R. Caruana, J. Gehrke and G. Hooker, Accurate intelligible models with pairwise interactions, in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2013), 623-631. doi: 10.1145/2487575.2487579.
    [44] S. M. LundbergG. ErionH. ChenA. DeGraveJ. M. PrutkinB. NairR. KatzJ. HimmelfarbN. Bansal and S.-I. Lee, From local explanations to global understanding with explainable AI for trees, Nature Machine Intelligence, 2 (2020), 56-67.  doi: 10.1038/s42256-019-0138-9.
    [45] S. M. Lundberg, G. G. Erion and S.-I. Lee, Consistent individualized feature attribution for tree ensembles, arXiv e-prints, 2018. arXiv: 1802.03888.
    [46] S. M. Lundberg and S.-I. Lee, A unified approach to interpreting model predictions, Advances in Neural Information Processing Systems, 30 (2017).
    [47] L. Merrick and A. Taly, The explanation game: Explaining machine learning models using Shapley values, in International Cross-Domain Conference for Machine Learning and Knowledge Extraction, Springer, 12279 (2020), 17-38. doi: 10.1007/978-3-030-57321-8_2.
    [48] A. Miroshnikov, K. Kotsiopoulos, K. Filom and A. R. Kannan, Stability theory of game-theoretic group feature explanations for machine learning models, arXiv e-prints, 2024. arXiv: 2102.10878v5.
    [49] C. Molnar, Interpretable Machine Learning, Lulu.com, 2020.
    [50] G. F. Montufar, R. Pascanu, K. Cho and Y. Bengio, On the number of linear regions of deep neural networks, Advances in Neural Information Processing Systems, 27 (2014).
    [51] A. Nahon, XGBoost, LightGBM or CatBoost-Which Boosting Algorithm Should I Use?, Medium.com/riskified-technology, 2019. https://medium.com/riskified-technology/xgboost-lightgbm-or-catboost-which-boosting-algorithm-should-i-use-e7fda7bb36bc.
    [52] H. Nori, S. Jenkins, P. Koch and R. Caruana, InterpretML: A unified framework for machine learning interpretability, arXiv e-prints, 2019. arXiv: 1909.09223.
    [53] G. Owen, Values of games with a priori unions, in Mathematical Economics and Game Theory, Springer, 141 (1977), 76-88. doi: 10.1007/978-3-642-45494-3_7.
    [54] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli and J. Sohl-Dickstein, On the expressive power of deep neural networks, in International Conference on Machine Learning, PMLR, (2017), 2847-2854.
    [55] Y. A. Reshef, D. N. Reshef, H. K. Finucane, P. C. Sabeti and M. Mitzenmacher, Measuring dependence powerfully and equitably, J. Mach. Learn. Res., 17 (2016), Paper No. 212, 63 pp.
    [56] M. T. Ribeiro, S. Singh and C. Guestrin, "Why should I trust you?" Explaining the predictions of any classifier, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2016), 1135-1144.
    [57] M. T. Ribeiro, S. Singh and C. Guestrin, Anchors: High-precision model-agnostic explanations, in Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. doi: 10.1609/aaai.v32i1.11491.
    [58] B. P. RoeH.-J. YangJ. ZhuY. LiuI. Stancu and G. McGregor, Boosted decision trees as an alternative to artificial neural networks for particle identification, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 543 (2005), 577-584.  doi: 10.1016/j.nima.2004.12.018.
    [59] C. Rudin, Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead, Nature Machine Intelligence, 1 (2019), 206-215.  doi: 10.1038/s42256-019-0048-x.
    [60] A. Saabas, Treeinterpreter Python Package, 2019. https://github.com/andosa/treeinterpreter.
    [61] S. Saha, XGBoost vs LightGBM: How Are They Different, Neptun.ai, 2022. https://neptune.ai/blog/xgboost-vs-lightgbm.
    [62] L. S. Shapley, A value for n-person games, Contributions to the Theory of Games, 2 (1953), 307-317.  doi: 10.1515/9781400881970-018.
    [63] Y. Shuo Tan, C. Singh, K. Nasseri, A. Agarwal and B. Yu, Fast interpretable greedy-tree sums (FIGS), arXiv e-prints, 2022. arXiv: 2201.11931.
    [64] R. Shwartz-Ziv and A. Armon, Tabular data: Deep learning is not all you need, Information Fusion, 81 (2022), 84-90.  doi: 10.1016/j.inffus.2021.11.011.
    [65] E. Štrumbelj and I. Kononenko, Explaining prediction models and individual predictions with feature contributions, Knowledge and Information Systems, 41 (2014), 647-665.  doi: 10.1007/s10115-013-0679-x.
    [66] A. Sudjianto, W. Knauth, R. Singh, Z. Yang and A. Zhang, Unwrapping the black box of deep ReLU networks: Interpretability, diagnostics, and simplification, arXiv e-prints, 2020. arXiv: 2011.04041.
    [67] A. Sudjianto and A. Zhang, Designing inherently interpretable machine learning models, arXiv e-prints, 2021. arXiv: 2111.01743.
    [68] M. Sundararajan and A. Najmi, The many Shapley values for model explanation, in International Conference on Machine Learning, PMLR, (2020), 9269-9278.
    [69] M. Sundararajan, A. Taly and Q. Yan, Axiomatic attribution for deep networks, in International Conference on Machine Learning, PMLR, (2017), 3319-3328.
    [70] L. Torgo and R. Camacho, Ailerons Data, OpenML Data Repository, 2014. https://www.openml.org/search?type=data&status=active&id=296.
    [71] L. Turgeman and J. H. May, A mixed-ensemble model for hospital readmission, Artificial Intelligence in Medicine, 72 (2016), 72-82.  doi: 10.1016/j.artmed.2016.08.005.
    [72] J. Vaughan, A. Sudjianto, E. Brahimi, J. Chen and V. N. Nair, Explainable neural networks based on additive index models, arXiv e-prints, 2018. arXiv: 1806.01933.
    [73] D. Whiteson, HIGGS, UCI Machine Learning Repository, 2014. doi: 10.24432/C5V312.
    [74] Q. WuC. J. BurgesK. M. Svore and J. Gao, Adapting boosting for information retrieval measures, Information Retrieval, 13 (2010), 254-270.  doi: 10.1007/s10791-009-9112-1.
    [75] J. Yang, Fast TreeSHAP: Accelerating SHAP value computation for trees, arXiv e-prints, 2021. arXiv: 2109.09847.
    [76] Z. YangA. Zhang and A. Sudjianto, Enhancing explainability of neural networks through architecture constraints, IEEE Transactions on Neural Networks and Learning Systems, 32 (2021), 2610-2621.  doi: 10.1109/TNNLS.2020.3007259.
    [77] Z. YangA. Zhang and A. Sudjianto, GAMI-Net: An explainable neural network based on generalized additive models with structured interactions, Pattern Recognition, 120 (2021), 108192.  doi: 10.1016/j.patcog.2021.108192.
    [78] H. P. Young, Monotonic solutions of cooperative games, International Journal of Game Theory, 14 (1985), 65-72.  doi: 10.1007/BF01769885.
    [79] Z. Zhang, Y. Zhao, A. Canes, D. Steinberg, O. Lyashevska, et al., Predictive analytics with gradient boosting in clinical medicine, Annals of Translational Medicine, 7 (2019). doi: 10.21037/atm.2019.03.29.
    [80] W. Zhao, R. Singh, T. Joshi, A. Sudjianto and V. N. Nair, Self-interpretable convolutional neural networks for text classification, arXiv e-prints, 2021. arXiv: 2105.08589.
  • 加载中
Open Access Under a Creative Commons license

Figures(10)

Tables(8)

SHARE

Article Metrics

HTML views(2228) PDF downloads(388) Cited by(0)

Access History

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return