Chapter 11 Missing data

11.1 Per variable missingness

Number and percentage of missing.

Variable Missing (count) Missing (%)
alcohol 395 6.61
sys 274 4.59
lbxtc 230 3.85
lbdhdd 230 3.85
bmi 44 0.74
educationadult 4 0.07
smokecigs 2 0.03
age 0 0.00
gender 0 0.00
diabetes 0 0.00
chf 0 0.00
cancer 0 0.00
stroke 0 0.00
tac 0 0.00
tlac 0 0.00
mvpa 0 0.00
wt 0 0.00

Investigate for groups of variables:

Variable Missing (count) Missing (%)
any_variable_missing 768 12.86
any_lab_missing 470 7.87
any_lifestyle_missing 431 7.22
any_demographics_missing 4 0.07
any_vip_missing 0 0.00
any_health_missing 0 0.00
any_physact_missing 0 0.00

Overall, 13% are missing when all variables are included in a model. There is no missingness in the VIPs and 7% in life style variables such as smoking and alcohol consumption.

11.2 Variable summaries for complete vs incomplete cases

Table 11.1: Participant characteristics by missing status
complete (N=5204) incomplete (N=768) Total (N=5972) p value
age 0.005
   Median 54.167 50.417 53.750
   Q1, Q3 41.917, 67.417 40.167, 66.167 41.646, 67.250
   Range 30.000 - 84.917 30.000 - 84.917 30.000 - 84.917
gender < 0.001
   Male 2601 (50.0%) 334 (43.5%) 2935 (49.1%)
   Female 2603 (50.0%) 434 (56.5%) 3037 (50.9%)
education level < 0.001
   N-Miss 0 4 4
   Less than high school 1431 (27.5%) 252 (33.0%) 1683 (28.2%)
   High school 1251 (24.0%) 197 (25.8%) 1448 (24.3%)
   More than high school 2522 (48.5%) 315 (41.2%) 2837 (47.5%)
diabetes 0.092
   No 4558 (87.6%) 656 (85.4%) 5214 (87.3%)
   Yes 646 (12.4%) 112 (14.6%) 758 (12.7%)
congestive heart failure 0.071
   No 5010 (96.3%) 729 (94.9%) 5739 (96.1%)
   Yes 194 (3.7%) 39 (5.1%) 233 (3.9%)
cancer 0.458
   No 4664 (89.6%) 695 (90.5%) 5359 (89.7%)
   Yes 540 (10.4%) 73 (9.5%) 613 (10.3%)
stroke 0.206
   No 5003 (96.1%) 731 (95.2%) 5734 (96.0%)
   Yes 201 (3.9%) 37 (4.8%) 238 (4.0%)
body mass index 0.003
   Median 28.060 28.290 28.080
   Q1, Q3 24.740, 32.180 24.432, 32.928 24.730, 32.230
   Range 13.360 - 130.210 14.650 - 63.420 13.360 - 130.210
smoking status 0.006
   N-Miss 0 2 2
   Never 2514 (48.3%) 397 (51.8%) 2911 (48.8%)
   Former 1571 (30.2%) 188 (24.5%) 1759 (29.5%)
   Current 1119 (21.5%) 181 (23.6%) 1300 (21.8%)
alcohol consumption 0.153
   Median 1.000 2.000 1.000
   Q1, Q3 1.000, 2.000 1.000, 2.000 1.000, 2.000
   Range 1.000 - 3.000 1.000 - 3.000 1.000 - 3.000
total log activity count (log(1+activity)) < 0.001
   Median 2925.685 2796.018 2910.926
   Q1, Q3 2401.400, 3440.724 2239.750, 3380.564 2384.757, 3430.648
   Range 313.083 - 6122.678 466.036 - 5102.369 313.083 - 6122.678
total accelerometer wear time 0.017
   Median 854.310 832.100 852.071
   Q1, Q3 785.000, 923.808 762.458, 909.571 782.851, 922.036
   Range 602.000 - 1440.000 600.000 - 1440.000 600.000 - 1440.000

11.3 Missingness patterns over variables

Missing values for each participant in the NHANES dataset is shown in the following figure, where the black lines correspond to observations with missing values.

There are 7 independent variables with missing values in the dataset.

Physiological variables (blood pressure, cholesterol) have the highest proportion of missingness. There does not seem to be a pattern of missingness across variables, other than the cholesterol (toal, HDL) variables.

In addition, we can explore missing data mechanisms and relationships between BMI and systolic blood pressure, included in the same scatterplot:

Missing values are seen across the same range of values for the other variable. There is an extreme value in BMI for males that is likely an entry error.

11.4 (In)complete cases

This section presents patients with a least one missing value. First we list out patients with at least one missing value in a filterable table.

Then we report the pattern of missing for this set of patients.

11.5 Section session info

## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Austria.1252  LC_CTYPE=English_Austria.1252   
## [3] LC_MONETARY=English_Austria.1252 LC_NUMERIC=C                    
## [5] LC_TIME=English_Austria.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] VIM_6.1.1        colorspace_2.0-3 arsenal_3.6.3    DT_0.22         
##  [5] kableExtra_1.3.4 gt_0.4.0         naniar_0.6.1     Hmisc_4.6-0     
##  [9] Formula_1.2-4    survival_3.2-13  lattice_0.20-45  forcats_0.5.1   
## [13] stringr_1.4.0    dplyr_1.0.8      purrr_0.3.4      readr_2.1.2     
## [17] tidyr_1.2.0      tibble_3.1.6     ggplot2_3.3.5    tidyverse_1.3.1 
## [21] here_1.0.1      
## 
## loaded via a namespace (and not attached):
##  [1] ellipsis_0.3.2      class_7.3-20        visdat_0.5.3       
##  [4] rprojroot_2.0.2     htmlTable_2.4.0     base64enc_0.1-3    
##  [7] fs_1.5.2            rstudioapi_0.13     proxy_0.4-26       
## [10] farver_2.1.0        fansi_1.0.3         lubridate_1.8.0    
## [13] ranger_0.13.1       xml2_1.3.3          splines_4.1.3      
## [16] robustbase_0.95-0   knitr_1.38          jsonlite_1.8.0     
## [19] broom_0.7.12        cluster_2.1.2       dbplyr_2.1.1       
## [22] png_0.1-7           compiler_4.1.3      httr_1.4.2         
## [25] backports_1.4.1     assertthat_0.2.1    Matrix_1.4-0       
## [28] fastmap_1.1.0       cli_3.2.0           htmltools_0.5.2    
## [31] tools_4.1.3         gtable_0.3.0        glue_1.6.2         
## [34] Rcpp_1.0.8.3        carData_3.0-5       cellranger_1.1.0   
## [37] jquerylib_0.1.4     vctrs_0.3.8         svglite_2.1.0      
## [40] crosstalk_1.2.0     lmtest_0.9-40       xfun_0.30          
## [43] laeken_0.5.2        rvest_1.0.2         lifecycle_1.0.1    
## [46] DEoptimR_1.0-11     zoo_1.8-9           MASS_7.3-55        
## [49] scales_1.1.1        hms_1.1.1           RColorBrewer_1.1-2 
## [52] yaml_2.3.5          gridExtra_2.3       UpSetR_1.4.0       
## [55] sass_0.4.1          rpart_4.1.16        latticeExtra_0.6-29
## [58] stringi_1.7.6       highr_0.9           e1071_1.7-9        
## [61] checkmate_2.0.0     boot_1.3-28         commonmark_1.8.0   
## [64] rlang_1.0.2         pkgconfig_2.0.3     systemfonts_1.0.4  
## [67] evaluate_0.15       labeling_0.4.2      htmlwidgets_1.5.4  
## [70] tidyselect_1.1.2    plyr_1.8.7          magrittr_2.0.2     
## [73] bookdown_0.25       R6_2.5.1            generics_0.1.2     
## [76] DBI_1.1.2           pillar_1.7.0        haven_2.4.3        
## [79] foreign_0.8-82      withr_2.5.0         abind_1.4-5        
## [82] sp_1.4-6            nnet_7.3-17         modelr_0.1.8       
## [85] crayon_1.5.1        car_3.0-12          utf8_1.2.2         
## [88] tzdb_0.2.0          rmarkdown_2.13      jpeg_0.1-9         
## [91] readxl_1.3.1        data.table_1.14.2   vcd_1.4-9          
## [94] reprex_2.0.1        digest_0.6.29       webshot_0.5.2      
## [97] munsell_0.5.0       viridisLite_0.4.0   bslib_0.3.1