2025-09-17: Classic Machine Learning Models and XAI Methods

  


Figure 1 in Dwivedi et al.

Traditional AI models are often viewed as “black boxes” whose decision-making processes are not revealed to humans, leading to a lack of trust and reliability. Explainable artificial intelligence (XAI) is a set of methods humans can use to understand the reasoning behind the decisions or predictions of models, which typically include SHAP, LIME, and PI. We will start from a basic model and then talk about those XAI methods. 


Logistic regression

Logistic regression (LR) is a classification model, particularly for problems with a binary outcome. It maps the output of linear regression to the interval [0, 1] through the sigmoid function, indicating the probability for a particular class. 

The logistic regression model is simple and highly interpretable. The coefficient of each feature intuitively reflects the impact of the feature, which is easy to understand and explain. Positive weights indicate that the feature is positively related to the positive class, and negatively related if negative weights. LR outputs the classification results, as well as the probability of that result. All those show that LR is an inherently interpretable model.

Logistic regression assumes that the features are linearly related to the log odds (ln(p/(1-p)), where p is the probability of an event occurring. This can be extended into a method for explainability: the odds ratioOdds ratios are not formally considered part of the XAI toolkit since they only work for LR, but it is practical and widely used in medical research and other fields.

An odds ratio is calculated by dividing the odds of an event occurring in one group by the odds of the event occurring in another group. For example, if the odds ratio for developing lung cancer is 81 for smokers compared to non-smokers, it means smokers are 81 times more likely to develop lung cancer. The OR value is calculated by using the regression coefficient in exponential formFor example, if the coefficient of a feature in the logistic regression model is 0.5, the OR value is:

It means that for every unit increase in the feature, the probability of the event occurring increases by approximately 64.87%.

The OR value can be used to find features with the highest influence on prediction results, and further used for feature selection or optimization. However, it only works for the LR model. In the rest of the blog, we will talk about other machine learning models and model-agnostic XAI methods.

We will discuss three machine learning models, each representing a distinct approach based on probability, tree models, and spatial distance, respectively.


Machine learning model 1:Naive Bayes

Naive Bayes is a classification model based on probability and Bayes' theorem. It assumes that features are independent of each other, which is not always true in reality. This "naive" assumption simplifies the problem but can potentially reduce the accuracy. 

Naive Bayes obtains the probability of each class, and then selects the class with the highest probability as the output. It calculates posterior probability with prior probability and conditional probability. For example, the probability of 'win' appearing in spam emails is 80%, and 10% in regular emails. Then we can calculate the probabilities of 'is spam email' and 'is regular email' through a series of calculations and pick the one with the higher probability.

Naive Bayes is insensitive to missing data, so it can still work effectively when there are missing values or when features are incomplete. It has good performance in high-dimensional data due to the independence assumption. However, it also has disadvantages. It's sensitive to input data distribution. Performance may decrease if the data does not follow a Gaussian distribution. 


Machine learning model 2: random forest

A decision tree is a learning algorithm with a tree-like structure to make predictions. A random forest uses a bagging of decision trees to make predictions. It randomly draws samples from the training set to train each decision tree. When each decision tree node splits, it randomly selects features to make the best split. It repeats the above steps to build multiple decision trees and form a random forest.

By integrating multiple decision trees, a random forest achieves better performance than a single decision tree. It can reduce overfitting with random sampling and random feature selection. It is insensitive to missing values ​​and outliers and can handle high-dimensional data. But compared with a single decision tree, the training time is longer. In addition, random forests rely on large amounts of data and may not perform well with small datasets.


Machine learning model 3: SVM (support vector machine)

The core of SVM is to find the hyperplane that best separates data points into different classes and to maximize the boundary between classes. It can be used for both classification and regression tasks.

SVM has good performance with high-dimensional sparse data (such as text data), as well as nonlinear classification problems, so it's particularly suitable for text classification and image recognition. In addition, SVMs are relatively robust against overfitting. Overall, SVM is a good choice for high-dimensional data with a small number of samples, but for large-scale data sets, SVM training takes a long time and thus is not a good choice


After discussing the representative machine learning models, we will see the model-agnostic XAI methods, which can be applied to any machine learning model, including linear models, tree models, neural networks, etc.


XAI method 1: SHAP

SHAP (Shapley Additive Explanations) is a model interpretation method based on cooperative game theory. Shapley values are calculated to quantify the importance of each feature by evaluating its marginal contributions to the model. SHAP local explanations reveal how specific features contribute to individual predictions for each sample. SHAP global explanations describe a model's overall behavior across the entire dataset by aggregating the SHAP values for all individual samples. Figure 1 demonstrates the importance ranking of features in global explanation, where 'Elevation' ranks first among all features. We can further look at each feature in detail by the dependence plot as shown in Figure 2, which shows the relationship between the target and the feature. It could be linear, monotonic, or more complex relationships. In addition, there are more visualization methods in the SHAP toolkit based on your needs.


Figure 1: Ranking of influencing features (Fig. 10 in Zhang et al.)


Figure 2: SHAP dependence plot of annual average rainfall (Fig. 14 in Zhang et al.)


XAI method 2: Permutation importance

Permutation importance (PI) is a method of global analysis and it does not focus on local results as SHAP does. It assesses the importance of features in a model by measuring the decrease or increase in performance when the values of a particular feature are randomly permuted, while keeping other features unchanged. By comparing the decrease/increase in performance to baseline performance, permutation importance provides insights into the relative importance of each feature in the model. The difference from baseline performance is the importance value, and it can be positive, negative, or zero. If the value is zero, it means the performance of data with a feature completely shuffled and put back into the original training set is the same as the original data. The model performance is the same with or without the feature and thus this feature is of low importanceIf the value is negative, it means it is better not to add this feature at all. Figure 1 shows one example of the ranking of features by permutation importance.


Figure 3: Ranking of features by permutation importance (from scikit-learn user guide)



XAI method 3: LIME

LIME is essentially a method for local analysis. It builds a simple interpretable model (such as a linear model) around the target sample and then the contribution of each feature to the prediction can be approximated by interpreting the simple model's coefficients. It consists of the following steps:

  1. Randomly select a sample to be explained.
  2. Generate a perturbation samples x′ near x.
  3. Use the complex model to predict the perturbation sample x′ and get the predicted value f(x′).
  4. Use the perturbation samples x′ and the corresponding predictions f(x′) to train a simple interpretable model (such as logistic regression).
  5. Interpret the complex model using the coefficients of the simple model.

LIME is generally used in local cases. For example, a bank can use LIME to determine the major factors that contribute to a customer being identified as risky by a machine learning model.


To sum up, we mainly have the following XAI methods: permutation importance, SHAP, LIME and odds ratio. Permutation importance and SHAP can give global explanations based on the whole dataset, while LIME can only provide local explanations based on a particular sample. Permutation importance measures how important a feature is for the model, and SHAP measures the marginal contribution of a feature. The first three methods are model-agnostic, while the odds ratio is only used for logistic regression and gives global explanations. We can choose to use one or more of the most suitable methods in real-life applications.


- Xin

Comments