Chapter 16 Missing data

16.1 Per variable missingness

Number and percentage of missing.

Variable Missing (count) Missing (%)
PAMY 7114 48.42
TRIG 5061 34.45
CHOL 5045 34.34
GLU 4192 28.53
AMY 3913 26.64
LIP 3699 25.18
HS 3061 20.84
FIB 2567 17.47
APTT 2549 17.35
NT 2467 16.79
CHE 2447 16.66
CK 2080 14.16
K 2008 13.67
MG 1869 12.72
LDH 1714 11.67
ALB 1676 11.41
TP 1583 10.78
GBIL 1441 9.81
AP 1400 9.53
NA. 1282 8.73
CA 1276 8.69
GGT 1262 8.59
PHOS 1242 8.45
ASAT 1154 7.86
PDW 1102 7.50
ALAT 987 6.72
BASOR 732 4.98
EOSR 732 4.98
LYMR 732 4.98
MONOR 732 4.98
NEUR 732 4.98
NEU 728 4.96
MPV 702 4.78
WBC 462 3.14
RBC 461 3.14
LYM 262 1.78
MONO 246 1.67
BUN_KREA 174 1.18
BUN 172 1.17
KREA 159 1.08
eGFR 159 1.08
CRP 155 1.06
BASO 146 0.99
EOS 135 0.92
RDW 56 0.38
MCV 42 0.29
HCT 42 0.29
PLT 42 0.29
MCH 42 0.29
MCHC 42 0.29
HGB 41 0.28
sex 0 0.00
Alter 0 0.00
BloodCulture 0 0.00

Investigate for groups of variables:

Variable Missing (count) Missing (%)
Any_Variable_missing 10712 72.92
Any_remaining_missing 10587 72.06
Any_VIP_leuko_kidney_acute_missing 5306 36.12
Any_Acute_missing 3139 21.37
Any_Kidney_missing 2065 14.06
Any_VIP_missing 898 6.11
Any_Leuko_missing 728 4.96
Any_Demographics_missing 0 0.00

From this table we learn that as long as we model with only VIPs or with leukocyte-related variables, we can expect less than 10% missing values and this may justify a complete-case analysis. Including also kidney- and acute phase related variables will raise the proportion of missing values to about 36% which leads to a significant drop in power. A multiple imputation may then recover a lot of the information and may in particular be beneficial to keep the power of the (otherwise very complete) VIPs.

16.2 Missingness patterns over variables

First we create a dendogram that shows which variables tend to be missing together:

Furthermore, with variables missing in more than 10% of the cases, we create a heatmap that simultaneously shows the clusters of patients with missing values and the variables:

In this heatmap, we see that CHOL and TRIG are always missing together (lowest hierarchy in dendogram), but there are no further such pairs among any other variables. There is also some evidence that when CHOL and TRIG are missing, also PAMY is missing, although this is not the case for a small proportion of patients. The big white space in the middle of the heatmap represents the approx. 30% of patients with no missing values in those variables.

16.3 Section session info

## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Austria.1252  LC_CTYPE=English_Austria.1252   
## [3] LC_MONETARY=English_Austria.1252 LC_NUMERIC=C                    
## [5] LC_TIME=English_Austria.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.22          kableExtra_1.3.4 gt_0.4.0         naniar_0.6.1    
##  [5] Hmisc_4.6-0      Formula_1.2-4    survival_3.2-13  lattice_0.20-45 
##  [9] forcats_0.5.1    stringr_1.4.0    dplyr_1.0.8      purrr_0.3.4     
## [13] readr_2.1.2      tidyr_1.2.0      tibble_3.1.6     ggplot2_3.3.5   
## [17] tidyverse_1.3.1  here_1.0.1      
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.2            lubridate_1.8.0     webshot_0.5.2      
##  [4] RColorBrewer_1.1-2  httr_1.4.2          rprojroot_2.0.2    
##  [7] tools_4.1.3         backports_1.4.1     bslib_0.3.1        
## [10] utf8_1.2.2          R6_2.5.1            rpart_4.1.16       
## [13] DBI_1.1.2           colorspace_2.0-3    nnet_7.3-17        
## [16] withr_2.5.0         tidyselect_1.1.2    gridExtra_2.3      
## [19] compiler_4.1.3      cli_3.2.0           rvest_1.0.2        
## [22] htmlTable_2.4.0     xml2_1.3.3          bookdown_0.25      
## [25] sass_0.4.1          scales_1.1.1        checkmate_2.0.0    
## [28] commonmark_1.8.0    systemfonts_1.0.4   digest_0.6.29      
## [31] foreign_0.8-82      svglite_2.1.0       rmarkdown_2.13     
## [34] base64enc_0.1-3     jpeg_0.1-9          pkgconfig_2.0.3    
## [37] htmltools_0.5.2     highr_0.9           dbplyr_2.1.1       
## [40] fastmap_1.1.0       htmlwidgets_1.5.4   rlang_1.0.2        
## [43] readxl_1.3.1        rstudioapi_0.13     jquerylib_0.1.4    
## [46] generics_0.1.2      jsonlite_1.8.0      magrittr_2.0.2     
## [49] Matrix_1.4-0        Rcpp_1.0.8.3        munsell_0.5.0      
## [52] fansi_1.0.3         lifecycle_1.0.1     visdat_0.5.3       
## [55] stringi_1.7.6       yaml_2.3.5          grid_4.1.3         
## [58] crayon_1.5.1        haven_2.4.3         splines_4.1.3      
## [61] hms_1.1.1           knitr_1.38          pillar_1.7.0       
## [64] reprex_2.0.1        glue_1.6.2          evaluate_0.15      
## [67] latticeExtra_0.6-29 data.table_1.14.2   modelr_0.1.8       
## [70] png_0.1-7           vctrs_0.3.8         tzdb_0.2.0         
## [73] cellranger_1.1.0    gtable_0.3.0        assertthat_0.2.1   
## [76] xfun_0.30           broom_0.7.12        viridisLite_0.4.0  
## [79] cluster_2.1.2       ellipsis_0.3.2