Appendix B — Supplementary Example

We now perform an example research project on the final analysis ready data which is stored at data/analysis.

B.1 Overview

In the following examples we use the Bacteremia data with complete observations regarding the key predictors PLT, BUN, NEU, AGE, CREA_T, WBC_noNEU_T, which represent 93.9% of the dataset w.r.t key predictors. We will fit a global logistic regression model with the outcome BACTEREMIAN (i.e. presence of bacteremia) and the key predictors as covariates. We will use pseudo-log transformations as suggested in the IDA. Within the model, all key predictors will be transformed by fractional polynomials of order 1 (df = 2).

The aim of the examples is to showcase how decisions derived from IDA influence the results of the fitted model.

B.1.1 Global Model

The global model will be fitted by the mfp function. If not indicated otherwise, we will use the fp-transformations of the key predictors determined in global model in all consecutive models. For all models we report McFaddens’s R² and the AUC, i.e. the area under the ROC curve, and boxplots comparing “BACTEREMIAN” predictions with outcomes.

To put the results of the global model into context, we will first review the histograms of the original and the transformed predictors:

B.1.2 Model Summary

The global model is specified as print(global_formula, quote = TRUE). The model is fit to the complete cases data on the transformed scale stored in the data set data_model_trans.

Report the global model fit.

global model - including transformed variables
term estimate std.error statistic p.value
(Intercept) −2.719 0.588 −4.63 3.74 × 10−6
I((WBC_noNEU_T + 0.1)^0.5) −5.322 0.260 −20.44 7.12 × 10−93
I(((NEU + 0.1)/10)^0.5) 1.521 0.117 13.05 6.44 × 10−39
I((AGE/100)^1) 1.684 0.199 8.46 2.57 × 10−17
I(CREA_T^1) 0.709 0.198 3.58 3.40 × 10−4
I(((PLT + 1)/100)^1) −0.082 0.032 −2.56 1.04 × 10−2
I((BUN/10)^1) 0.011 0.024 0.45 6.53 × 10−1

Plot predictions by outcome and report the model AUC.

Global model - including covariates on the transformed psuedo log scale

B.1.3 Functional forms of global model

We now take a look at the functional forms of the covariates in the global model, which are determined by the fp algorithm. Besides scaling factors, only for WBC_noNEU_T the fp algorithm chose a non-linear transformation (note the ‘^0.5’ in the term column). This means all other covariates enter the model in a linear fashion. In the following effect plots, each variable is adjusted to the median of the other variables in the model.

(a)

(b)

(c)

(d)

(e)

(f)

Figure B.1: Functional forms of global model - repeats over plots

B.2 Example 1: to transform or not to transform

Only for one out of the six key predictors did the fp algorithm chose a non-linear transformation. But out of those six variables, four were pseudo-log transformed before entering the model. In the first example we want to compare the global model to a model using the key predictors on their original scale.

Global model without pseudo-log tranformations
term estimate std.error statistic p.value
(Intercept) −3.18 0.28 −11.17 5.48 × 10−29
log((WBC_noNEU + 0.2)) −1.25 0.06 −20.22 5.92 × 10−91
I(((NEU + 0.1)/10)^0.5) 1.52 0.12 12.99 1.34 × 10−38
I((AGE/100)^1) 1.64 0.20 8.26 1.51 × 10−16
I(((PLT + 1)/100)^1) −0.08 0.03 −2.53 1.13 × 10−2
I((BUN/10)^1) 0.02 0.02 0.68 5.00 × 10−1
I(CREA^-0.5) −0.77 0.21 −3.69 2.20 × 10−4

Note the different fp-transformation arising when the key predictors are not pseudo-log transformed. On the original scale, three covariates instead of one now enter the model via a non-linear fp-transformation. This suggests that a transformation prior to the regression model ‘outsources’ the need for transformations within the model. Now let us compare the model performances.

Global model(s) with covariates on original and transformed scale.

With regards to McFadden’s R² and the AUC, the differences between the two approaches is marginal.

Next we will compare the differences of the functional forms in the two models for those covariates where a pseudo-log transformation was suggested in IDA. We will look at the log odds for bacteremia by each covariate on the original and the transformed scale, and compare the global model using the original and the pseudo-log transformed covariates. Each variable is adjusted for the median of all other variables used.

Comparing functional form of both global models.

B.3 Example 2: Interpretation of regression coefficient ‘size’

As a result of IDA, it turned out that the key predictor WBC_noNEU should be transformed for further analyses because of its extreme skewness. As a consequence, the original variable WBC_noNEU and the pseudo-logged variable WBC_noNEU_T are now on two very different scales:

To illustrate that the scale of the original variable determines the size of the regression coefficient, let us consider two logistic regression models with WBC_noNEU and t_WBC_noNEU as single covariate, respectively.

Comparing original and transformed scale
term estimate std.error statistic
WBC_noNEU −0.557 0.035 −15.731
WBC_noNEU_T −3.008 0.160 −18.839

Even if the model with transformed WBC has a slightly better prediction performance as revealed by the summary plots below, this is not the reason for the difference. The estimates -0.56 and -3.01 indicate the difference in log odds for the outcome when the ‘term’ variable differs by 1 unit, but cannot be compared directly. A \(1\) unit difference is only a small step on the original scale, where WBC_noNEU covers values from -0.15 up to 152.74. In comparison, WBC_noNEU_T lies between -0.06 up to 2.41, so a difference of \(1\) unit cover almost half the range of the variable.

B.4 Example 3: the support of a model determines what it can explain

Next we compare the global model to a model were for an important variable, in our case we chose age, the variable support is reduced to the central 50% of the data (i.e. data within the 25% and 75% quantiles), and for comparison we also generate a scenario where we randomly sample 50% of the data.

When evaluating the apparent predictive performance in terms of the AUC and the scaled Brier score, we see that the model fitted on the participants with age between the first and third quartiles performs worse than the global model and the model where 50% were randomly sampled.

data AUC scaled Brier score
complete 0.731 0.065
central 50% of age distribution 0.723 0.053
randomly sampled 50% 0.738 0.065

Plot the model predictions based on different missing data scenarios

Model predictions based on different missing data scenarios.

B.5 Example 4: the limits of mulitiple imputation

To show the effect of multiple imputation if the number of missing values is high, we construct a dataset with 50% artificially generated missing values in one variable. First, recall the output of the complete model, relying on the Bacteremia data with complete cases regarding the key predictors.

global model
term estimate std.error statistic p.value
(Intercept) −2.72 0.59 −4.63 3.74 × 10−6
I((WBC_noNEU_T + 0.1)^0.5) −5.32 0.26 −20.44 7.12 × 10−93
I(((NEU + 0.1)/10)^0.5) 1.52 0.12 13.05 6.44 × 10−39
I((AGE/100)^1) 1.68 0.20 8.46 2.57 × 10−17
I(CREA_T^1) 0.71 0.20 3.58 3.40 × 10−4
I(((PLT + 1)/100)^1) −0.08 0.03 −2.56 1.04 × 10−2
I((BUN/10)^1) 0.01 0.02 0.45 6.53 × 10−1

Creatinine (‘CREA’) is significant at a level that might not survive substantial missingness. We thus create a dataset were we artificially introduce 50% missing creatinine values, missing completely at random.

Next we fit a ‘complete case’ model in the case of missing creatinine data, using the fp-transformations from the global model.

Now we impute the missing creatinine data using MICE with 50 imputations, fit the model using the fp-transformations from the global model and pool the results.

We now can compare the outputs of the complete model, the complete model with missing data (i.e. only half of the original complete data is used), and the imputed model.

Comparison of the model with no missing data, with complete cases, and with imputed data
term model estimate std.error statistic p.value
(Intercept) missing, complete cases −2.652 0.842 −3.151 1.63 × 10−3
missing, imputed −2.854 0.706 −4.045 6.10 × 10−5
no missing data −2.719 0.588 −4.625 3.74 × 10−6
I(((NEU + 0.1)/10)^0.5) missing, complete cases 1.518 0.168 9.012 2.02 × 10−19
missing, imputed 1.524 0.117 13.064 8.98 × 10−39
no missing data 1.521 0.117 13.049 6.44 × 10−39
I(((PLT + 1)/100)^1) missing, complete cases −0.101 0.045 −2.226 2.60 × 10−2
missing, imputed −0.076 0.032 −2.395 1.67 × 10−2
no missing data −0.082 0.032 −2.563 1.04 × 10−2
I((AGE/100)^1) missing, complete cases 1.756 0.283 6.201 5.60 × 10−10
missing, imputed 1.648 0.199 8.267 1.49 × 10−16
no missing data 1.684 0.199 8.464 2.57 × 10−17
I((BUN/10)^1) missing, complete cases 0.002 0.033 0.046 9.64 × 10−1
missing, imputed 0.007 0.026 0.276 7.83 × 10−1
no missing data 0.011 0.024 0.449 6.53 × 10−1
I((WBC_noNEU_T + 0.1)^0.5) missing, complete cases −5.502 0.375 −14.654 1.28 × 10−48
missing, imputed −5.336 0.261 −20.466 1.01 × 10−91
no missing data −5.322 0.260 −20.442 7.12 × 10−93
I(CREA_T^1) missing, complete cases 0.745 0.284 2.621 8.77 × 10−3
missing, imputed 0.764 0.245 3.119 1.95 × 10−3
no missing data 0.709 0.198 3.582 3.40 × 10−4

When half of the data is missing, the standard errors of the coefficients increase by a factor of approximately \(\sqrt 2\). Multiple imputation is able to recreate standard errors very close to those of the model with no missing data in most variables, but not in the one that was being imputed, namely creatinine. Its relative importance in the model is higher in the complete cases model than in the imputed model.

B.6 Example 5: Plot of functional form should focus on areas with high density

The functional forms have wide confidence intervals when the data is sparse. In presentations of the effects, plots of the functional forms should focus on areas with high density. In this analysis, PLT was very sparse above ~800 [UNITS], which is reflected in a large confidence interval for high PLT values. In the effect plot PLT values could be limited to values <800 [UNITS].

Model predictions

WBC_noNEU comparison

PLT comparison