| Title: | Visualization of BART and BARP using SHAP |
|---|---|
| Description: | Complex machine learning models are often difficult to interpret. Shapley values serve as a powerful tool to understand and explain why a model makes a particular prediction. This package computes variable contributions using permutation-based Shapley values for Bayesian Additive Regression Trees (BART) and its extension with Post-Stratification (BARP). The permutation-based SHAP method proposed by Strumbel and Kononenko (2014) <doi:10.1007/s10115-013-0679-x> is grounded in data obtained via MCMC sampling. Similar to the BART model introduced by Chipman, George, and McCulloch (2010) <doi:10.1214/09-AOAS285>, this package leverages Bayesian posterior samples generated during model estimation, allowing variable contributions to be computed without requiring additional sampling. The BART model is designed to work with the following R packages: 'BART' <doi:10.18637/jss.v097.i01>, 'bartMachine' <doi:10.18637/jss.v070.i04>, and 'dbarts' <https://CRAN.R-project.org/package=dbarts>. For XGBoost and baseline adjustments, the approach by Lundberg et al. (2020) <doi:10.1038/s42256-019-0138-9> is also considered. The BARP model proposed by Bisbee (2019) <doi:10.1017/S0003055419000480> was implemented with reference to <https://github.com/jbisbee1/BARP> and is designed to work with modified functions based on that implementation. BARP extends post-stratification by computing variable contributions within each stratum defined by stratifying variables. The resulting Shapley values are visualized through both global and local explanation methods. |
| Authors: | Dong-eun Lee [aut, cre], Eun-Kyung Lee [aut] |
| Maintainer: | Dong-eun Lee <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 1.0.11 |
| Built: | 2026-05-26 06:10:49 UTC |
| Source: | https://github.com/ldongeunl/bartxviz |
This function uses Bayesian Additive Regression Trees (BART) to extrapolate survey data to a level of geographic aggregation at which the original survey was not sampled to be representative of.
This is a modified version of the barp function from the BARP to allow for seed fixation.(https://github.com/jbisbee1/BARP)
barps( y, x, dat, census, geo.unit, algorithm = "BARP", setSeed = NULL, proportion = "None", cred_int = c(0.025, 0.975), BSSD = FALSE, nsims = 200, ... )barps( y, x, dat, census, geo.unit, algorithm = "BARP", setSeed = NULL, proportion = "None", cred_int = c(0.025, 0.975), BSSD = FALSE, nsims = 200, ... )
y |
Outcome of interest. Should be a character of the column name containing the variable of interest. |
x |
Prognostic covariates. Should be a vector of column names corresponding to the covariates used to predict the outcome variable of interest. |
dat |
Survey data containing the x and y column names. The explanatory variables X included in the model must be converted to factors prior to input. |
census |
Census data containing the x column names. It must also have the same structure as X. If the user provides raw census data, BARP will calculate proportions for each unique bin of x covariates. Otherwise, the researcher must calculate bin proportions and indicate the column name that contains the proportions, either as percentages or as raw counts. |
geo.unit |
The column name corresponding to the unit at which outcomes should be aggregated. |
algorithm |
Algorithm for predicting opinions. Can be any algorithm(s) included in the SuperLearner package. If multiple algorithms are listed, predicted opinions are provided for each separately, as well as for the weighted ensemble. Defaults to |
setSeed |
Seed to control random number generation. |
proportion |
The column name corresponding to the proportions for covariate bins in the Census data. If left to the default |
cred_int |
A vector giving the lower and upper bounds on the credible interval for the predictions. |
BSSD |
Calculate bootstrapped standard deviation. Defaults to |
nsims |
The number of bootstrap simulations. |
... |
Additional arguments to be passed to bartMachine or SuperLearner. |
Returns an object of class BARP, containing a list of the following components:
pred.opn |
A |
trees |
A |
risk |
A |
barp.dat |
Data containing the estimates and credible intervals for each observation in the input census dataset. |
setSeed |
The random seed value employed during model estimation using bartMachine. |
proportion |
The number of observations in each combination of features. |
x |
The names of the explanatory variables included in the model. |
https://github.com/jbisbee1/BARP
barps is used to implement Bayesian Additive Regression Trees based on the bartMachine package.
For detailed options, see https://CRAN.R-project.org/package=bartMachine.
barps also uses the SuperLearner package to implement alternative regularizers.
For more details, see https://CRAN.R-project.org/package=SuperLearner.
A dataset used for modeling support for abortion coverage in insurance plans in the United States, combining individual- and district-level covariates from a 2018 survey. Out of all U.S. states, only data from 30 states are included in this dataset.
A data frame with 49,095 rows and 6 variables:
Allow employers to decline coverage of abortions in insurance plans (0: Oppose, 1: Support).
A factor variable representing the 30 selected U.S. states. The raw data includes states with state_fips values ranging from 1 to 30.
An ethnicity variable stored as a factor (White, Black, Hispanic, Other).
A factor variable representing gender (Female, Male).
A factor variable dividing respondents into age groups (18–29, 30–39, 40–49, 50–59, 60–69, 70+).
A factor variable representing educational attainment (No HS, HS, Some college, 4-Year College, Post-grad).
Juan Lopez-Martin, Justin H. Phillips, and Andrew Gelman, Multilevel Regression and Poststratification Case Studies https://bookdown.org/jl5522/MRP-case-studies/introduction-to-mrp.html#data
The data frame has the following components:
This dataset provides population counts in covariate bins based on the 2006 U.S. Census, Each row represents a unique combination of demographic covariates within a state. A data frame with 2940 rows and 9 variables:
Numeric identifier for the state
Region code
Age group (1 = 18-30, 2=31-50, 3= 51-65, 4 =65+)
Gender and race interaction
Education level (1 = LTHS,2 = HS,3 = Some Coll,4 = Coll+)
Republican presidential vote share in the previous election
Proportion of population identifying as religious conservatives
State-level ideology score (liberal to conservative)
Population count for the given covariate bin within the state
Bisbee, James. "Barp: Improving mister p using bayesian additive regression trees." American Political Science Review 113.4 (2019): 1060-1065. <https://github.com/jbisbee1/BARP>
The decision_plot function is a graph that visualizes how individual features
contribute to a model's prediction for a specific observation using Shapley values.
It can be used to visualize one or multiple observations.
decision_plot( object, obs_num = NULL, title = NULL, geo.unit = NULL, geo.id = NULL, bar_default = TRUE )decision_plot( object, obs_num = NULL, title = NULL, geo.unit = NULL, geo.id = NULL, bar_default = TRUE )
object |
Enter the name of the object that contains the model's contributions and results obtained using the Explain function. |
obs_num |
single or multiple observation numbers |
title |
plot title |
geo.unit |
The name of the stratum variable in the BARP model as a character. |
geo.id |
Enter a single value of the stratum variable as a character. |
bar_default |
|
plot_out |
The decision plot for one or multiple observations specified in |
## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## BART model model <- dbarts::bart (X,y, keeptrees = TRUE,ndpost = 200 ) # prediction wrapper function pfun <- function (object, newdata) { predict(object, newdata) } # Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) # Single observation decision_plot(model_exp, obs_num=1 ) #Multiple observation decision_plot(model_exp, obs_num=10:40 ) ## End(Not run)## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## BART model model <- dbarts::bart (X,y, keeptrees = TRUE,ndpost = 200 ) # prediction wrapper function pfun <- function (object, newdata) { predict(object, newdata) } # Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) # Single observation decision_plot(model_exp, obs_num=1 ) #Multiple observation decision_plot(model_exp, obs_num=10:40 ) ## End(Not run)
Compute fast (approximate) Shapley values for a set of features using the Monte Carlo algorithm described in Strumbelj and Igor (2014). An efficient algorithm for tree-based models, commonly referred to as Tree SHAP, is also supported for lightgbm(https://cran.r-project.org/package=lightgbm) and xgboost(https://cran.r-project.org/package=xgboost) models; see Lundberg et. al. (2020) for details.
Explain(object, ...) ## Default S3 method: Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper = NULL, newdata = NULL, parallel = FALSE, ... ) ## S3 method for class 'lm' Explain( object, feature_names = NULL, X, nsim = 1, pred_wrapper, newdata = NULL, exact = FALSE, parallel = FALSE, ... ) ## S3 method for class 'xgb.Booster' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper, newdata = NULL, exact = FALSE, parallel = FALSE, ... ) ## S3 method for class 'lgb.Booster' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper, newdata = NULL, exact = FALSE, parallel = FALSE, ... )Explain(object, ...) ## Default S3 method: Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper = NULL, newdata = NULL, parallel = FALSE, ... ) ## S3 method for class 'lm' Explain( object, feature_names = NULL, X, nsim = 1, pred_wrapper, newdata = NULL, exact = FALSE, parallel = FALSE, ... ) ## S3 method for class 'xgb.Booster' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper, newdata = NULL, exact = FALSE, parallel = FALSE, ... ) ## S3 method for class 'lgb.Booster' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper, newdata = NULL, exact = FALSE, parallel = FALSE, ... )
object |
A fitted model object (e.g., a
|
... |
Additional arguments to be passed |
feature_names |
Character string giving the names of the predictor
variables (i.e., features) of interest. If |
X |
A matrix-like R object (e.g., a data frame or matrix) containing
ONLY the feature columns from the training data (or suitable background data
set). If the input includes categorical variables that need to be one-hot encoded,
please input data that has been processed using |
nsim |
The number of Monte Carlo repetitions to use for estimating each
Shapley value (only used when |
pred_wrapper |
Prediction function that requires two arguments,
|
newdata |
A matrix-like R object (e.g., a data frame or matrix)
containing ONLY the feature columns for the observation(s) of interest; that
is, the observation(s) you want to compute explanations for. Default is
|
parallel |
Logical indicating whether or not to compute the approximate
Shapley values in parallel across features; default is |
exact |
Logical indicating whether to compute exact Shapley values.
Currently only available for |
An object of class Explain with the following components :
newdata |
The data frame formatted dataset employed for the estimation of Shapley values. If a variable has categories, categorical variables are one-hot encoded. |
phis |
A list format containing Shapley values for individual variables. |
fnull |
The expected value of the model's predictions. |
fx |
The prediction value for each observation. |
factor_names |
The name of the categorical variable.
If the data contains only continuous or dummy variables, it is set to |
Setting exact = TRUE with a linear model (i.e., an
stats::lm() or stats::glm() object) assumes that the
input features are independent.
Strumbelj, E., and Igor K. (2014). Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems, 41(3), 647-665.
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., and Lee, Su-In (2020). From local explanations to global understanding with Explainable AI for trees. Nature Machine Intelligence, 2(1), 2522–5839.
## Not run: # # A projection pursuit regression (PPR) example # # Load the sample data; see datasets::mtcars for details data(mtcars) # Fit a projection pursuit regression model fit <- ppr(mpg ~ ., data = mtcars, nterms = 5) # Prediction wrapper pfun <- function(object, newdata) { # needs to return a numeric vector predict(object, newdata = newdata) } # Compute approximate Shapley values using 10 Monte Carlo simulations set.seed(101) # for reproducibility shap <- Explain(fit, X = subset(mtcars, select = -mpg), nsim = 10, pred_wrapper = pfun) ## End(Not run)## Not run: # # A projection pursuit regression (PPR) example # # Load the sample data; see datasets::mtcars for details data(mtcars) # Fit a projection pursuit regression model fit <- ppr(mpg ~ ., data = mtcars, nterms = 5) # Prediction wrapper pfun <- function(object, newdata) { # needs to return a numeric vector predict(object, newdata = newdata) } # Compute approximate Shapley values using 10 Monte Carlo simulations set.seed(101) # for reproducibility shap <- Explain(fit, X = subset(mtcars, select = -mpg), nsim = 10, pred_wrapper = pfun) ## End(Not run)
Computes global numerical summaries of Shapley values using two averaging criteria: observation-based and posterior-sample-based.
Explain_stats( x, probs = 0.95, abs = TRUE, na.rm = TRUE, geo.unit = NULL, geo.id = NULL ) ## Default S3 method: Explain_stats( x, probs = 0.95, abs = TRUE, na.rm = TRUE, geo.unit = NULL, geo.id = NULL ) ## S3 method for class 'Explainbarp' Explain_stats( x, probs = 0.95, abs = TRUE, na.rm = TRUE, geo.unit = NULL, geo.id = NULL )Explain_stats( x, probs = 0.95, abs = TRUE, na.rm = TRUE, geo.unit = NULL, geo.id = NULL ) ## Default S3 method: Explain_stats( x, probs = 0.95, abs = TRUE, na.rm = TRUE, geo.unit = NULL, geo.id = NULL ) ## S3 method for class 'Explainbarp' Explain_stats( x, probs = 0.95, abs = TRUE, na.rm = TRUE, geo.unit = NULL, geo.id = NULL )
x |
An object belonging to the |
probs |
Enter the probability for the quantile interval. Default is |
abs |
Logical. If |
na.rm |
Logical. Remove NA values in summaries. Default is |
geo.unit |
(Explainbarp only) Name of the stratification variable used in post-stratification. |
geo.id |
(Explainbarp only) One value of interest among the values of the stratification variable. |
A named list with two elements:
obs |
A data.frame containing observation-based numerical summaries of Shapley values for each variable. |
post |
A data.frame containing posterior-sample-based numerical summaries of Shapley values for each variable. |
## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using the dbarts library model <- dbarts::bart(X,y,keeptrees = TRUE , ndpost = 200) ## prediction wrapper function pfun <- function(object, newdata) { predict(object , newdata) } ## Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) # Numerical summaries of summarised Shapley values Explain_stats ( model_exp, probs = 0.95) ## End(Not run)## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using the dbarts library model <- dbarts::bart(X,y,keeptrees = TRUE , ndpost = 200) ## prediction wrapper function pfun <- function(object, newdata) { predict(object , newdata) } ## Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) # Numerical summaries of summarised Shapley values Explain_stats ( model_exp, probs = 0.95) ## End(Not run)
This function is implemented to calculate the contribution of each variable in the BARP (Bayesian Additive Regression Tree with post-stratification) model using the permutation method.
## S3 method for class 'barp' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper = NULL, census = NULL, geo.unit = NULL, parallel = FALSE, ... )## S3 method for class 'barp' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper = NULL, census = NULL, geo.unit = NULL, parallel = FALSE, ... )
object |
A BARP model (Bayesian Additive Regression Tree) estimated
using the |
feature_names |
The name of the variable for which you want to check the contribution.
The default value is set to |
X |
The dataset containing all independent variables used as input when estimating the BART model. The explanatory variables |
nsim |
The number of Monte Carlo sampling iterations, which is fixed at |
pred_wrapper |
A function used to estimate the predicted values of the model. |
census |
Census data containing the names of the |
geo.unit |
Enter the name of the stratification variable used in post stratification. |
parallel |
The default value is set to |
... |
Additional arguments to be passed |
Returns of class Explainbarp with consisting of a list with the following components:
phis |
A list containing the Shapley values for each variable. |
newdata |
The data used to check the contribution of variables. If a variable has two categories, it is dummy-coded, and if it has three or more categories, categorical variables are one-hot encoded. |
fnull |
The expected value of the model's predictions. |
fx |
The prediction value for each observation. |
factor_names |
The name of the categorical variable. If the data contains only continuous or dummy variables, it is set to |
bart
Explain.bart function is used to calculate the contribution of each variable
in the Bayesian Additive Regression Trees (BART) model using permutation.
It is used to compute the Shapley values of models estimated using the bart function from the dbarts.
## S3 method for class 'bart' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper = NULL, newdata = NULL, parallel = FALSE, ... )## S3 method for class 'bart' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper = NULL, newdata = NULL, parallel = FALSE, ... )
object |
A BART model (Bayesian Additive Regression Tree) estimated
using the |
feature_names |
The name of the variable for which you want to check the contribution.
The default value is set to |
X |
The dataset containing all independent variables used as input when estimating the BART model. |
nsim |
The number of Monte Carlo sampling iterations, which is fixed at |
pred_wrapper |
A function used to estimate the predicted values of the model. |
newdata |
New data containing the variables included in the model.
This is used when checking the contribution of newly input data using the model.
The default value is set to |
parallel |
The default value is set to |
... |
Additional arguments to be passed |
Returns of class ExplainBART with consisting of a list with the following components:
phis |
A list containing the Shapley values for each variable. |
newdata |
The data used to check the contribution of variables. If a variable has categories, categorical variables are one-hot encoded. |
fnull |
The expected value of the model's predictions. |
fx |
The prediction value for each observation. |
factor_names |
The name of the categorical variable. If the data contains only continuous or dummy variables, it is set to |
## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using the dbarts library model <- dbarts::bart(X,y,keeptrees = TRUE , ndpost = 200) ## prediction wrapper function pfun <- function(object, newdata) { predict(object , newdata) } ## Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) ## End(Not run)## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using the dbarts library model <- dbarts::bart(X,y,keeptrees = TRUE , ndpost = 200) ## prediction wrapper function pfun <- function(object, newdata) { predict(object , newdata) } ## Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) ## End(Not run)
bartMachine
This function is used to calculate the contribution of each variable
in the Bayesian Additive Regression Trees (BART) model using permutation.
It is used to compute the Shapley values of models estimated
using the bartMachine function from the bartMachine.
## S3 method for class 'bartMachine' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper = NULL, newdata = NULL, parallel = FALSE, ... )## S3 method for class 'bartMachine' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper = NULL, newdata = NULL, parallel = FALSE, ... )
object |
A BART model (Bayesian Additive Regression Tree) estimated
using the |
feature_names |
The name of the variable for which you want to check the contribution.
The default value is set to |
X |
The dataset containing all independent variables used as input when estimating the BART model. Categorical or character variables must not contain an underscore ("_") in their values or labels. |
nsim |
The number of Monte Carlo repetitions used for estimating each Shapley value is set to |
pred_wrapper |
A function used to estimate the predicted values of the model. |
newdata |
New data containing the variables included in the model.
This is used when checking the contribution of newly input data using the model.
The default value is set to |
parallel |
The default value is set to |
... |
Additional arguments to be passed |
An object of class ExplainbartMachine with the following components :
phis |
A list containing the Shapley values for each variable. |
newdata |
The data used to check the contribution of variables. If a variable has categories, categorical variables are one-hot encoded. |
fnull |
The expected value of the model's predictions. |
fx |
The prediction value for each observation. |
factor_names |
The name of the categorical variable. If the data contains only continuous or dummy variables, it is set to |
## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using the bartMachine model <- bartMachine::bartMachine(X, y, seed = 2025, num_iterations_after_burn_in =200 ) ## prediction wrapper function pfun <- function (object, newdata) { bartMachine::bart_machine_get_posterior(object,newdata) $ y_hat_posterior_samples } ## Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) ## End(Not run)## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using the bartMachine model <- bartMachine::bartMachine(X, y, seed = 2025, num_iterations_after_burn_in =200 ) ## prediction wrapper function pfun <- function (object, newdata) { bartMachine::bart_machine_get_posterior(object,newdata) $ y_hat_posterior_samples } ## Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) ## End(Not run)
wbart or gbart
Explain.wbart function is used to calculate the contribution of each variable
in the Bayesian Additive Regression Trees (BART) model using permutation.
It is used to compute the Shapley values of models estimated using the wbart or gbart functions from BART.
## S3 method for class 'wbart' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper = NULL, newdata = NULL, parallel = FALSE, ... )## S3 method for class 'wbart' Explain( object, feature_names = NULL, X = NULL, nsim = 1, pred_wrapper = NULL, newdata = NULL, parallel = FALSE, ... )
object |
A BART model (Bayesian Additive Regression Tree) estimated
using the |
feature_names |
The name of the variable for which you want to check the contribution.
The default value is set to |
X |
The dataset containing all independent variables used as input when estimating the BART model. |
nsim |
The number of Monte Carlo repetitions used for estimating each Shapley value is set to |
pred_wrapper |
A function used to estimate the predicted values of the model. |
newdata |
New data containing the variables included in the model.
This is used when checking the contribution of newly input data using the model.
The default value is set to |
parallel |
The default value is set to |
... |
Additional arguments to be passed |
Returns of class ExplainBART with consisting of a list with the following components:
phis |
A list containing the Shapley values for each variable. |
newdata |
The data used to check the contribution of variables. If a variable has categories, categorical variables are one-hot encoded. |
fnull |
The expected value of the model's predictions. |
fx |
The prediction value for each observation. |
factor_names |
The name of the categorical variable.
If the data contains only continuous or dummy variables, it is set to |
## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using the BART model <- BART::wbart(X,y,ndpost=200) ## prediction wrapper function pfun <- function(object, newdata) { predict(object , newdata) } ## Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) ## End(Not run)## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using the BART model <- BART::wbart(X,y,ndpost=200) ## prediction wrapper function pfun <- function(object, newdata) { predict(object , newdata) } ## Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) ## End(Not run)
One-Hot-Encode unordered factor columns of a data.table
one_hot( dt, cols = "auto", sparsifyNAs = FALSE, naCols = FALSE, dropCols = TRUE, dropUnusedLevels = FALSE )one_hot( dt, cols = "auto", sparsifyNAs = FALSE, naCols = FALSE, dropCols = TRUE, dropUnusedLevels = FALSE )
dt |
A data.table |
cols |
Which column(s) should be one-hot-encoded? DEFAULT = "auto" encodes all unordered factor columns |
sparsifyNAs |
Should NAs be converted to 0s? |
naCols |
Should columns be generated to indicate the present of NAs? Will only apply to factor columns with at least one NA |
dropCols |
Should the resulting data.table exclude the original columns which are one-hot-encoded? |
dropUnusedLevels |
Should columns of all 0s be generated for unused factor levels? |
One-hot-encoding converts an unordered categorical vector (i.e. a factor) to multiple binarized vectors where each binary vector of 1s and 0s indicates the presence of a class (i.e. level) of the of the original vector.
data.table object From the input data, a data frame in which categorical variables have been one-hot encoded is returned.
https://cran.r-project.org/web/packages/mltools
library(data.table) dt <- data.table( ID = 1:4, color = factor(c("red", NA, "blue", "blue"), levels=c("blue", "green", "red")) ) one_hot(dt) one_hot(dt, sparsifyNAs=TRUE) one_hot(dt, naCols=TRUE) one_hot(dt, dropCols=FALSE) one_hot(dt, dropUnusedLevels=TRUE)library(data.table) dt <- data.table( ID = 1:4, color = factor(c("red", NA, "blue", "blue"), levels=c("blue", "green", "red")) ) one_hot(dt) one_hot(dt, sparsifyNAs=TRUE) one_hot(dt, naCols=TRUE) one_hot(dt, dropCols=FALSE) one_hot(dt, dropUnusedLevels=TRUE)
The plot.Explain function provides various visualization methods for Shapley values.
The values and format used in the graph are determined based on the input parameters.
## S3 method for class 'Explain' plot( x, average = NULL, type = NULL, num_post = NULL, plot.flag = TRUE, adjust = FALSE, probs = 0.95, title = NULL, xlab = NULL, ylab = NULL, ... )## S3 method for class 'Explain' plot( x, average = NULL, type = NULL, num_post = NULL, plot.flag = TRUE, adjust = FALSE, probs = 0.95, title = NULL, xlab = NULL, ylab = NULL, ... )
x |
An |
average |
Input the reference value for calculating the mean of the object's |
type |
|
num_post |
To check the contribution of variables for a single posterior sample, enter a value within the number of posterior samples. |
plot.flag |
If |
adjust |
The default value is |
probs |
Enter the probability for the quantile interval. The default value is |
title |
The title of the plot, with a default value of |
xlab |
Enter the label to be displayed on the x-axis. If not provided, a default label will be used. |
ylab |
Enter the label for the y-axis if needed. |
... |
Additional arguments to be passed |
The plot is returned based on the specified option.:
out |
If average is |
This function is implemented to visualize the computed Shapley values in
various ways for objects of the Explainbarp class. The type of plot
generated depends on the input parameters.
Since the BARP model is designed to be visualized for a single stratum,
the user must specify both the stratum variable and the value of the stratum to be visualized.
## S3 method for class 'Explainbarp' plot( x, average = NULL, type = NULL, num_post = NULL, plot.flag = TRUE, adjust = FALSE, probs = 0.95, title = NULL, geo.unit = NULL, geo.id = NULL, xlab = NULL, ylab = NULL, ... )## S3 method for class 'Explainbarp' plot( x, average = NULL, type = NULL, num_post = NULL, plot.flag = TRUE, adjust = FALSE, probs = 0.95, title = NULL, geo.unit = NULL, geo.id = NULL, xlab = NULL, ylab = NULL, ... )
x |
An |
average |
Input the reference value for calculating the mean of the object's phi list.
|
type |
|
num_post |
To check the contribution of variables for a single posterior sample, enter a value within the number of posterior samples. |
plot.flag |
If |
adjust |
The default value is |
probs |
Enter the probability for the quantile interval. The default value is |
title |
The title of the plot, with a default value of |
geo.unit |
Enter the name of the stratification variable used in post stratification. |
geo.id |
Enter one value of interest among the values of the stratification variable. |
xlab |
Enter the label to be displayed on the x-axis. If not provided, a default label will be used. |
ylab |
Enter the label for the y-axis if needed. |
... |
Additional arguments to be passed |
The plot is returned based on the specified option.:
out |
If average is |
The plot.ExplainBART function provides various visualization methods for Shapley values.
It is designed to visualize ExplainBART class objects, which contain Shapley values computed from models estimated using the bart function from the dbarts or the wbart/gbart functions from BART.
The values and format used in the graph are determined based on the input parameters.
## S3 method for class 'ExplainBART' plot( x, average = NULL, type = NULL, num_post = NULL, plot.flag = TRUE, adjust = FALSE, probs = 0.95, title = NULL, xlab = NULL, ylab = NULL, ... )## S3 method for class 'ExplainBART' plot( x, average = NULL, type = NULL, num_post = NULL, plot.flag = TRUE, adjust = FALSE, probs = 0.95, title = NULL, xlab = NULL, ylab = NULL, ... )
x |
An |
average |
Input the reference value for calculating the mean of the object's |
type |
|
num_post |
To check the contribution of variables for a single posterior sample, enter a value within the number of posterior samples. |
plot.flag |
If |
adjust |
The default value is |
probs |
Enter the probability for the quantile interval. The default value is |
title |
The title of the plot, with a default value of |
xlab |
Enter the label to be displayed on the x-axis. If not provided, a default label will be used. |
ylab |
Enter the label for the y-axis if needed. |
... |
Additional arguments to be passed |
The plot is returned based on the specified option.:
out |
If average is |
## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using dbarts model <- dbarts::bart (X,y, keeptrees = TRUE, ndpost = 200) # prediction wrapper function pfun <- function (object, newdata) { predict(object, newdata) } # Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) # Distribution of Shapley values (boxplot) # computed based on observation and posterior sample criteria plot(model_exp,average = "both" ) # Barplot based on observation criteria plot(model_exp,average = "obs",type ="bar",probs = 0.95) # Barplot based on posterior sample plot(model_exp,average = "post",type ="bar" ) # Summary plot based on posterior sample plot(model_exp,average = "post",type ="bees" ) # Summary plot of the 100th posterior sample plot(model_exp,average = "post",type ="bees",num_post = 100) # Barplot of the adjusted baseline plot(model_exp, type ="bar", adjust= TRUE ) ## End(Not run)## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using dbarts model <- dbarts::bart (X,y, keeptrees = TRUE, ndpost = 200) # prediction wrapper function pfun <- function (object, newdata) { predict(object, newdata) } # Calculate shapley values model_exp <- Explain ( model, X = X, pred_wrapper = pfun ) # Distribution of Shapley values (boxplot) # computed based on observation and posterior sample criteria plot(model_exp,average = "both" ) # Barplot based on observation criteria plot(model_exp,average = "obs",type ="bar",probs = 0.95) # Barplot based on posterior sample plot(model_exp,average = "post",type ="bar" ) # Summary plot based on posterior sample plot(model_exp,average = "post",type ="bees" ) # Summary plot of the 100th posterior sample plot(model_exp,average = "post",type ="bees",num_post = 100) # Barplot of the adjusted baseline plot(model_exp, type ="bar", adjust= TRUE ) ## End(Not run)
The plot.ExplainbartMachine function provides various visualization methods for Shapley values.
It is designed to visualize ExplainbartMachine class objects, which contain Shapley values computed from models estimated using the bartMachine function from the bartMachine.
The values and format used in the graph are determined based on the input parameters.
## S3 method for class 'ExplainbartMachine' plot( x, average = NULL, type = NULL, num_post = NULL, plot.flag = TRUE, adjust = FALSE, probs = 0.95, title = NULL, xlab = NULL, ylab = NULL, ... )## S3 method for class 'ExplainbartMachine' plot( x, average = NULL, type = NULL, num_post = NULL, plot.flag = TRUE, adjust = FALSE, probs = 0.95, title = NULL, xlab = NULL, ylab = NULL, ... )
x |
An |
average |
Input the reference value for calculating the mean of the object's |
type |
|
num_post |
To check the contribution of variables for a single posterior sample, enter a value within the number of posterior samples. |
plot.flag |
If |
adjust |
The default value is |
probs |
Enter the probability for the quantile interval. The default value is |
title |
The title of the plot, with a default value of |
xlab |
Enter the label to be displayed on the x-axis. If not provided, a default label will be used. |
ylab |
Enter the label for the y-axis if needed. |
... |
Additional arguments to be passed |
The plot is returned based on the specified option.:
out |
If average is |
A dataset used for poststratification modeling based on the 2014–2018 American Community Survey (ACS),
including 12,000 combinations of demographic and geographic strata. The table includes more combinations than the number of observed units, so some strata are not represented in the CCES sample. This dataset includes only the 30 states that are present in the cces_30_df.
A data frame with 7200 rows and 6 variables:
A factor variable representing the 30 selected U.S. states. The same states as those included in cces_30_df.
An ethnicity variable stored as a factor (White, Black, Hispanic, Other).
A factor variable representing gender (Female, Male).
A factor variable dividing respondents into age groups (18–29, 30–39, 40–49, 50–59, 60–69, 70+).
A factor variable representing educational attainment (No HS, HS, Some college, 4-Year College, Post-grad).
Number of individuals in each demographic-geographic combination
Juan Lopez-Martin, Justin H. Phillips, and Andrew Gelman, Multilevel Regression and Poststratification Case Studies https://bookdown.org/jl5522/MRP-case-studies/introduction-to-mrp.html#data
A dataset used for modeling support for gay marriage in the United States, combining individual- and state-level covariates from a 2006 survey.
A data frame with 5000 rows and 11 variables:
Unique observation identifier
Two-letter abbreviation for U.S. state
Numeric identifier for the state
Region code
Age group (1 = 18-30, 2 = 31-50, 3 = 51-65, 4 = 65+)
Gender and race interaction
Education level (1 = LTHS,2 = HS,3 = Some Coll,4 = Coll+)
Support for gay marriage (0 = oppose, 1 = support)
Republican presidential vote share in the previous election
Proportion of population identifying as religious conservatives
State-level ideology score (liberal to conservative)
Bisbee, James. "Barp: Improving mister p using bayesian additive regression trees." American Political Science Review 113.4 (2019): 1060-1065.
The waterfall_plot function is a bar chart that displays the positive and
negative contributions across sequential data points, visualizing how each
variable's contributions change for a single observation.
waterfall_plot( object, obs_num = NULL, title = NULL, geo.unit = NULL, geo.id = NULL, obs_name = NULL )waterfall_plot( object, obs_num = NULL, title = NULL, geo.unit = NULL, geo.id = NULL, obs_name = NULL )
object |
Enter the name of the object that contains the model's contributions and results obtained using the Explain function. |
obs_num |
observation number (only one) |
title |
plot title |
geo.unit |
The name of the stratum variable in the BARP model as a character. |
geo.id |
Enter a single value of the stratum variable as a character. |
obs_name |
Enter the name of the vector containing observation IDs or names. |
The function returns a waterfall plot.
plot_out |
The waterfall plot of the observation at index |
## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using dbarts model <- dbarts::bart (X, y, keeptrees = TRUE, ndpost = 200) # prediction wrapper function pfun <- function (object, newdata) { predict(object, newdata) } # Calculate shapley values model_exp <- Explain(model, X = X, pred_wrapper=pfun) # Waterfall plot of 100th observation waterfall_plot(model_exp, obs_num=100) ## End(Not run)## Not run: ## Friedman data set.seed(2025) n <- 200 p <- 5 X <- data.frame(matrix(runif(n * p), ncol = p)) y <- 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n) ## Using dbarts model <- dbarts::bart (X, y, keeptrees = TRUE, ndpost = 200) # prediction wrapper function pfun <- function (object, newdata) { predict(object, newdata) } # Calculate shapley values model_exp <- Explain(model, X = X, pred_wrapper=pfun) # Waterfall plot of 100th observation waterfall_plot(model_exp, obs_num=100) ## End(Not run)