8  Univariate distribution checks

This section reports a series of univariate summary checks of the bacteremia dataset.

8.1 Data set overview

Using the Hmisc describe function, we provide an overview of the data set. The descriptive report also provides histograms of continuous variables. For ease of scanning the information, we group the report by measurement type.

8.1.1 Demographic variables

Demographic variables

2 Variables   14691 Observations

AGE: Patient Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
14691085156.1720.7824294358707984
lowest : 16 17 18 19 20 , highest: 96 97 98 99 101
SEX: Patient sex 1=male, 2=female
nmissingdistinctInfoMeanGmd
14691020.731.4190.4869
 Value          1     2
 Frequency   8536  6155
 Proportion 0.581 0.419
 

8.1.2 Structural covariates and key predictors

Structural covariates and key predictors

7 Variables   14691 Observations

WBC: White blood count G/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
142294622710111.237.602 2.66 4.26 6.63 9.6013.5318.2222.27
lowest : 0.00 0.01 0.02 0.03 0.04 , highest: 365.30 383.74 387.73 433.83 604.47
AGE: Patient Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
14691085156.1720.7824294358707984
lowest : 16 17 18 19 20 , highest: 96 97 98 99 101
SEX: Patient sex 1=male, 2=female
nmissingdistinctInfoMeanGmd
14691020.731.4190.4869
 Value          1     2
 Frequency   8536  6155
 Proportion 0.581 0.419
 

BUN: Blood urea nitrogen mg/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
14519172947122.6616.92 7.1 8.611.616.626.944.860.8
lowest : 2.5 2.7 2.8 2.9 3.0 , highest: 160.6 171.3 171.9 176.0 184.8
CREA: Creatinine mg/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1453215967411.3290.85180.6200.6900.8101.0001.3502.1603.144
lowest : 0.26 0.27 0.28 0.29 0.30 , highest: 15.24 15.40 15.67 16.64 20.75
NEU: Neutrophiles G/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1396372837418.3675.776 1.60 2.70 4.60 7.3010.8015.0818.40
lowest : 0.0 0.1 0.2 0.3 0.4 , highest: 54.0 56.4 63.7 71.6 83.8
PLT: Blood platelets G/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
14649427181220130.1 50 81140204277369445
lowest : 0 1 2 3 4 , highest: 1068 1211 1321 1639 2092

8.1.6 Remaining variables

Remaining variables

29 Variables   14691 Observations

MCV: Mean corpuscular volume pg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1464942506188.356.99278.281.184.788.392.095.999.0
lowest : 51.0 52.6 54.9 56.3 57.5 , highest: 121.0 121.8 124.6 127.9 128.7
HGB: Haemoglobin G/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1465041157111.572.558 8.2 8.8 9.911.413.214.615.4
lowest : 3.0 3.1 3.5 3.9 4.1 , highest: 19.5 20.5 20.7 20.8 21.0
HCT: Haematocrit %
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1464942404134.487.31624.626.429.834.339.142.944.8
lowest : 0.0 0.1 0.2 9.7 9.8 , highest: 61.4 61.9 63.2 65.3 66.6
MCH: Mean corpuscular hemoglobin fl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1464942232129.582.69325.326.728.429.731.032.433.4
lowest : 14.9 15.6 15.9 16.0 16.5 , highest: 42.0 42.3 42.4 42.5 47.4
MCHC: Mean corpuscular hemoglobin concentration g/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
14649421240.99933.471.54631.131.732.633.534.435.235.6
lowest : 23.7 24.4 24.8 25.1 26.1 , highest: 38.3 38.4 38.9 39.3 43.5
RDW: Red blood cell distribution width %
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
14635561731152.38512.412.713.414.516.018.019.5
lowest : 10.6 11.1 11.2 11.3 11.4 , highest: 28.6 28.9 29.1 29.7 31.8
MPV: Mean platelet volume fl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
13989702710.99910.381.132 8.9 9.2 9.710.311.011.712.2
lowest : 7.3 7.7 7.8 7.9 8.0 , highest: 14.2 14.3 14.6 14.8 15.0
NT: Normotest %
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
122242467149183.2230.56 35 48 67 83101118128
lowest : 4 5 6 7 8 , highest: 148 149 150 151 152
APTT: Activated partial thromboplastin time sec
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
121422549631140.069.53330.131.434.137.742.749.956.6
lowest : 21.4 21.6 23.4 23.5 23.6 , highest: 160.7 163.0 168.7 171.6 176.1
SODIUM: Sodium mmol/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
134091282580.994137.25.034129132135137140142144
lowest : 106 108 109 110 112 , highest: 161 165 166 168 170
CA: Calcium mmol/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
13415127618512.2140.22131.891.962.092.222.352.452.51
lowest : 1.03 1.15 1.18 1.20 1.23 , highest: 3.84 3.88 3.96 4.18 4.40
PHOS: Phosphate mmol/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
13449124230611.0480.39930.550.640.810.991.201.471.74
lowest : 0.30 0.31 0.32 0.33 0.34 , highest: 4.36 4.43 4.53 5.48 6.22
MG: Magnesium mmol/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1282218691460.9990.81360.16090.590.640.720.810.890.981.06
lowest : 0.20 0.21 0.22 0.26 0.28 , highest: 1.83 1.88 1.96 2.07 2.22
HS: Uric acid mg/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
11630306116915.4132.625 2.2 2.7 3.7 5.0 6.6 8.510.0
lowest : 1.3 1.4 1.5 1.6 1.7 , highest: 19.8 20.2 22.2 22.3 22.7
GBIL: Bilirubin mg/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
13250144188511.4061.4770.330.390.530.771.232.343.96
lowest : 0.11 0.12 0.13 0.14 0.15 , highest: 42.82 43.83 45.10 51.72 51.77
TP: Total protein G/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
131081583649164.912.9745.2049.4756.9065.7073.3078.8082.00
lowest : 29.9 30.0 30.3 30.5 30.6 , highest: 107.8 108.1 108.7 112.8 120.9
ALB: Albumin G/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
130151676401133.428.51321.323.627.933.639.143.245.2
lowest : 10.0 10.2 10.5 10.6 10.7 , highest: 52.9 53.2 53.7 54.0 55.7
AMY: Amylase U/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
107783913488190.83100.5 18 23 33 49 76125187
lowest : 8 9 10 11 12 , highest: 4984 5248 40372 43970 56146
 Value          0   500  1000  1500  2000  2500  4000  4500  5000 40500 44000 56000
 Frequency  10432   268    39    14    12     4     2     2     2     1     1     1
 Proportion 0.968 0.025 0.004 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 
For the frequency table, variable is rounded to the nearest 500
PAMY: Pancreas amylase U/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
757771142800.99941.6647.28 7 91422366497
lowest : 1 2 3 4 5 , highest: 1673 2083 2116 3066 38369
 Value          0   500  1000  1500  2000  3000 38500
 Frequency   7495    65     7     6     2     1     1
 Proportion 0.989 0.009 0.001 0.001 0.000 0.000 0.000
 
For the frequency table, variable is rounded to the nearest 500
LIP: Lipases U/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
109923699444163.8289.88 6 8 14 23 40 79135
lowest : 0 1 2 3 4 , highest: 11469 15843 18560 22339 45991
CHE: Cholinesterase kU/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
12244244799714.792.3781.702.173.154.606.227.658.49
lowest : 0.98 0.99 1.00 1.01 1.02 , highest: 12.39 12.55 12.97 13.32 13.89
AP: Alkaline phosphatase U/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1329114006721118.891.51 42 49 63 84123206302
lowest : 11 14 15 16 17 , highest: 1980 2132 2549 2596 2995
LDH: Lactate dehydrogenase U/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
12977171411371331.2240.9136152187239332508724
lowest : 39 46 54 55 56 , highest: 10473 10784 10822 11246 13906
CK: Creatinine kinases U/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
12611208015061385615.4 18 25 42 80 184 5771155
lowest : 8 9 10 11 12 , highest: 60799 63011 82180 83880 98801
GLU: Glucoses mg/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1049941923891126.448.3 78 85 97113138177216
lowest : 19 22 23 26 28 , highest: 843 848 890 1349 1403
TRIG: Triclyceride mg/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
963050615381141.790.33 54 64 83115165241307
lowest : 14 15 16 20 22 , highest: 1796 2247 2662 2918 5440
CHOL: Cholesterol mg/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
964650453391150.859.23 74 89113145182219243
lowest : 25 26 27 28 29 , highest: 646 662 676 710 1104
PDW: Platelet distribution width %
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
135891102167112.292.375 9.3 9.810.812.013.415.116.4
lowest : 6.6 6.8 6.9 7.0 7.1 , highest: 24.1 24.7 24.9 25.2 25.3
RBC: Red blood count T/L
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
14230461650.9993.9360.87722.72.93.43.94.54.95.2
lowest : 1.0 1.1 1.2 1.3 1.4 , highest: 7.2 7.4 7.6 7.7 8.2

8.2 Categorical variables

We now provide a closer visual examination of the categorical predictors.

8.3 Continuous variables

8.3.1 Suggested transformations

Next we investigate whether a transformation of continuous variables may improve any further analyses to reduce disproportional impact of highly influential points, also in multivariate summaries. We employ a function ida_trans for this purpose, which optimises the parameter sigma of the pseudo-logarithm for that purpose. The optimization targets the best possible linear correlation of the transformed values with normal deviates. If no better transformation can be found, no transformation is suggested.

Show the code

variables<- c("AGE", structural_vars, key_predictors, leuko_related_vars, leuko_ratio_vars, kidney_related_vars, acute_related_vars, remaining_vars)
unique.variables <- unique(variables)

res<-sapply(unique.variables, function(X) ida_trans(b_bact[,X])$const) #takes long, calculate once, and save?

res
       AGE        WBC        SEX        BUN       CREA        NEU        PLT 
        NA 2.14364604         NA 0.03198339 0.03193846         NA         NA 
       EOS       BASO        LYM       MONO       NEUR       EOSR      BASOR 
0.12561255 0.13215999 0.17979933 0.23679427         NA 0.47139320 0.19315481 
      LYMR      MONOR     POTASS       eGFR   BUN_CREA        FIB        CRP 
1.70910135 3.11197362         NA         NA 0.01953382         NA         NA 
      ASAT       ALAT        GGT        MCV        HGB        HCT        MCH 
0.02818736 1.01761807 0.02827878         NA         NA         NA         NA 
      MCHC        RDW        MPV         NT       APTT     SODIUM         CA 
        NA         NA         NA         NA 0.03047767         NA         NA 
      PHOS         MG         HS       GBIL         TP        ALB        AMY 
0.12526560         NA         NA 0.03306450         NA         NA 0.01844397 
      PAMY        LIP        CHE         AP        LDH         CK        GLU 
0.03036179 1.02765958         NA 0.02384583 0.03166182 0.03282045 0.02766430 
      TRIG       CHOL        PDW        RBC 
0.03242708         NA         NA         NA 

Register transformed variables in the data set:

Show the code

for(j in 1:length(unique.variables)){
  if(!is.na(res[j])){
    newname <- paste("t_",unique.variables[j],sep="")
    newlabel <- paste("pseudo-log of",label(b_bact)[unique.variables[j]])
    names(newlabel)<-newname
    x<-pseudo_log(b_bact[[unique.variables[j]]], sigma=res[j], base=10)
    label(x)<-newlabel
    b_bact[[newname]] <- x
    upData(b_bact, labels=newlabel)
  }
}
Input object size:   5575040 bytes;  57 variables    14691 observations
New object size:    5574816 bytes;  57 variables    14691 observations
Input object size:   5693696 bytes;  58 variables    14691 observations
New object size:    5693472 bytes;  58 variables    14691 observations
Input object size:   5812336 bytes;  59 variables    14691 observations
New object size:    5812112 bytes;  59 variables    14691 observations
Input object size:   5930976 bytes;  60 variables    14691 observations
New object size:    5930752 bytes;  60 variables    14691 observations
Input object size:   6049616 bytes;  61 variables    14691 observations
New object size:    6049392 bytes;  61 variables    14691 observations
Input object size:   6168256 bytes;  62 variables    14691 observations
New object size:    6168032 bytes;  62 variables    14691 observations
Input object size:   6286896 bytes;  63 variables    14691 observations
New object size:    6286672 bytes;  63 variables    14691 observations
Input object size:   6405536 bytes;  64 variables    14691 observations
New object size:    6405312 bytes;  64 variables    14691 observations
Input object size:   6524176 bytes;  65 variables    14691 observations
New object size:    6523952 bytes;  65 variables    14691 observations
Input object size:   6642824 bytes;  66 variables    14691 observations
New object size:    6642600 bytes;  66 variables    14691 observations
Input object size:   6761464 bytes;  67 variables    14691 observations
New object size:    6761240 bytes;  67 variables    14691 observations
Input object size:   6880128 bytes;  68 variables    14691 observations
New object size:    6879896 bytes;  68 variables    14691 observations
Input object size:   6998784 bytes;  69 variables    14691 observations
New object size:    6998560 bytes;  69 variables    14691 observations
Input object size:   7117440 bytes;  70 variables    14691 observations
New object size:    7117216 bytes;  70 variables    14691 observations
Input object size:   7236096 bytes;  71 variables    14691 observations
New object size:    7235872 bytes;  71 variables    14691 observations
Input object size:   7354768 bytes;  72 variables    14691 observations
New object size:    7354544 bytes;  72 variables    14691 observations
Input object size:   7473408 bytes;  73 variables    14691 observations
New object size:    7473184 bytes;  73 variables    14691 observations
Input object size:   7592048 bytes;  74 variables    14691 observations
New object size:    7591824 bytes;  74 variables    14691 observations
Input object size:   7710688 bytes;  75 variables    14691 observations
New object size:    7710464 bytes;  75 variables    14691 observations
Input object size:   7829328 bytes;  76 variables    14691 observations
New object size:    7829104 bytes;  76 variables    14691 observations
Input object size:   7947968 bytes;  77 variables    14691 observations
New object size:    7947744 bytes;  77 variables    14691 observations
Input object size:   8066624 bytes;  78 variables    14691 observations
New object size:    8066400 bytes;  78 variables    14691 observations
Input object size:   8185280 bytes;  79 variables    14691 observations
New object size:    8185056 bytes;  79 variables    14691 observations
Input object size:   8303936 bytes;  80 variables    14691 observations
New object size:    8303712 bytes;  80 variables    14691 observations
Input object size:   8422576 bytes;  81 variables    14691 observations
New object size:    8422352 bytes;  81 variables    14691 observations
Input object size:   8541216 bytes;  82 variables    14691 observations
New object size:    8540992 bytes;  82 variables    14691 observations

Show the code

sigma_values <- res


c_bact <- b_bact

# update variable lists - generate a second list with transformed variables replacing the originals

bact_transformed <- bact_variables

for(j in 1:length(bact_variables)){
  for(jj in 1:length(bact_variables[[j]])){
      if(!is.na(res[bact_variables[[j]][jj]])) bact_transformed[[j]][jj] <- paste("t_", bact_variables[[j]][jj], sep="")
  }
}

8.3.2 Univariate distribution with variables using the original variable and the suggested transformations

Show the code

for(j in 1:length(unique.variables)){
  print(ida_plot_univar(b_bact, unique.variables[j], sigma=res[j], n_bars=100))
#  if(!is.na(res[j])){
#    print(ida_plot_univar(b_bact, paste("t_",variables[j],sep="")))
#  }
}

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

8.3.3 Comparison of univariate distributions with and without pseudo-log transformation

The comparison is only shown for variables where a transformation is suggested.

Show the code

for(j in 1:length(unique.variables)){
#  print(ida_plot_univar_orig_vs_trans(b_bact, unique.variables[j], sigma=res[j], n_bars=100))
 if(!is.na(res[j])){
   print(ida_plot_univar_orig_vs_trans(b_bact, unique.variables[j], sigma=res[j], n_bars=100))
 }
}

Warning: Removed 1 rows containing missing values (geom_bar).
Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Show the code

save(list=c("c_bact", "bact_variables", "sigma_values", "bact_transformed"), 
     file=here::here("data", "bact_env_c.rda"))

8.3.4 Univariate distribution with variables using only the original variable without the suggested transformations

Show the code

for(j in 1:length(unique.variables)){
  print(ida_plot_univar(b_bact, unique.variables[j], sigma=res[j], n_bars=100, transform = FALSE))
#  if(!is.na(res[j])){
#    print(ida_plot_univar(b_bact, paste("t_",variables[j],sep="")))
#  }
}

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

8.3.5 Comparison of univariate distributions with and without pseudo-log transformation

The comparison is only shown for variables where a transformation is suggested.

Show the code

for(j in 1:length(unique.variables)){
#  print(ida_plot_univar_orig_vs_trans(b_bact, unique.variables[j], sigma=res[j], n_bars=100))
 if(!is.na(res[j])){
   print(ida_plot_univar_orig_vs_trans(b_bact, unique.variables[j], sigma=res[j], n_bars=100))
 }
}

Warning: Removed 1 rows containing missing values (geom_bar).
Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 1 rows containing missing values (geom_bar).

8.4 Section session info

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] Hmisc_4.6-0     Formula_1.2-4   survival_3.2-13 lattice_0.20-45
 [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4    
 [9] readr_2.1.1     tidyr_1.2.0     tibble_3.1.7    ggplot2_3.3.6  
[13] tidyverse_1.3.1 here_1.0.1     

loaded via a namespace (and not attached):
 [1] httr_1.4.2          jsonlite_1.7.2      splines_4.1.2      
 [4] modelr_0.1.8        assertthat_0.2.1    latticeExtra_0.6-29
 [7] renv_0.15.5         cellranger_1.1.0    pillar_1.7.0       
[10] backports_1.4.1     glue_1.6.2          digest_0.6.29      
[13] checkmate_2.1.0     RColorBrewer_1.1-3  rvest_1.0.2        
[16] colorspace_2.0-3    htmltools_0.5.2     Matrix_1.3-4       
[19] pkgconfig_2.0.3     broom_0.8.0         haven_2.4.3        
[22] patchwork_1.1.1     scales_1.2.0        jpeg_0.1-9         
[25] tzdb_0.2.0          htmlTable_2.3.0     farver_2.1.0       
[28] generics_0.1.2      ellipsis_0.3.2      withr_2.5.0        
[31] nnet_7.3-16         cli_3.3.0           magrittr_2.0.3     
[34] crayon_1.5.1        readxl_1.3.1        evaluate_0.14      
[37] fs_1.5.2            fansi_1.0.3         xml2_1.3.3         
[40] foreign_0.8-81      data.table_1.14.2   tools_4.1.2        
[43] hms_1.1.1           lifecycle_1.0.1     munsell_0.5.0      
[46] reprex_2.0.1        cluster_2.1.2       compiler_4.1.2     
[49] rlang_1.0.3         grid_4.1.2          rstudioapi_0.13    
[52] htmlwidgets_1.5.4   labeling_0.4.2      base64enc_0.1-3    
[55] rmarkdown_2.11      gtable_0.3.0        DBI_1.1.2          
[58] R6_2.5.1            gridExtra_2.3       lubridate_1.8.0    
[61] knitr_1.37          fastmap_1.1.0       utf8_1.2.2         
[64] rprojroot_2.0.2     stringi_1.7.6       Rcpp_1.0.8.3       
[67] vctrs_0.4.1         rpart_4.1-15        png_0.1-7          
[70] dbplyr_2.1.1        tidyselect_1.1.2    xfun_0.31