Chapter 14 Introduction to Bacteremia
To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan, which is described in more detail in the section Bact_SAP.Rmd.
Hypothetical research aim for IDA is to develop a multivariable diagnostic model for bacteremia using 49 continuous laboratory blood parameters, age and gender with the primary aim of prediction and a secondary aim of describing the association of each variable with the outcome (‘explaining’ the multivariable model).
A diagnostic prediction model was developed based on this data set and validated in “A Risk Prediction Model for Screening Bacteremic Patients: A Cross Sectional Study” Ratzinger et al, PLoS One 2014. The assumed research aim is in line with this diagnostic prediction model.
14.1 Dataset Description
Ratzinger et al (2014) performed a diagnostic study in which age, sex and 49 laboratory variables can be used to diagnose bacteremia status of a blood sample using a multivariable model. Between January 2006 and December 2010, patients with the clinical suspicion to suffer from bacteraemia were included if blood culture analysis was requested by the responsible physician and blood was sampled for assessment of haematology and biochemistry. The data consists of 14,691 observations from different patients.
Our version of this data was slightly modified compared to original version, and this modified version was cleared by the Medical University of Vienna for public use (DC 2019-0054). Variable names have been kept as they were (partly German abbreviations). A data dictionary is available in the misc folder of the project directory (‘bacteremia-DataDictionary.csv’).
In the original paper describing the study (Ratzinger et al, PLoS One 2014), a machine learning approach was taken to diagnose a positive status of blood culture. The true status was determined for all blood samples by blood culture analysis, which is the gold standard. Here we will make use of a multivariable logistic regression model.
14.2 Bacteremia dataset contents
14.2.1 Source dataset
We refer to the source data set as the dataset available in this repository.
Display the source dataset contents. This dataset is in the data folder of the project directory.
Data frame:bact
14691 observations and 53 variables, maximum # NAs:7114Name | Storage | NAs |
---|---|---|
ID | integer | 0 |
sex | integer | 0 |
Alter | integer | 0 |
MCV | double | 42 |
HGB | double | 41 |
HCT | double | 42 |
PLT | integer | 42 |
MCH | double | 42 |
MCHC | double | 42 |
RDW | double | 56 |
MPV | double | 702 |
LYM | double | 262 |
MONO | double | 246 |
EOS | double | 135 |
BASO | double | 146 |
NT | integer | 2467 |
APTT | double | 2549 |
FIB | integer | 2567 |
NA. | integer | 1282 |
K | double | 2008 |
CA | double | 1276 |
PHOS | double | 1242 |
MG | double | 1869 |
KREA | double | 159 |
BUN | double | 172 |
HS | double | 3061 |
GBIL | double | 1441 |
TP | double | 1583 |
ALB | double | 1676 |
AMY | integer | 3913 |
PAMY | integer | 7114 |
LIP | integer | 3699 |
CHE | double | 2447 |
AP | integer | 1400 |
ASAT | integer | 1154 |
ALAT | integer | 987 |
GGT | integer | 1262 |
LDH | integer | 1714 |
CK | integer | 2080 |
GLU | integer | 4192 |
TRIG | integer | 5061 |
CHOL | integer | 5045 |
CRP | double | 155 |
BASOR | double | 732 |
EOSR | double | 732 |
LYMR | double | 732 |
MONOR | double | 732 |
NEU | double | 728 |
NEUR | double | 732 |
PDW | double | 1102 |
RBC | double | 461 |
WBC | double | 462 |
BloodCulture | character | 0 |
14.2.2 Updated analysis dataset
Additional meta-data is added to the original source data set. We write this new modified (annotated) data set back to the data folder after adding additional meta-data for all variables. The meta-data is taken from the data dictionary.
At the stage we could select the variables of interest to take in to the IDA phase by dropping variables we do not check in IDA.
As a cross check we display the contents again to ensure the additional data is added, and then write the changes to the data folder in the file “data/a_bact.rda”.
Input object size: 5119632 bytes; 53 variables 14691 observations New object size: 5159904 bytes; 53 variables 14691 observations Input object size: 5277552 bytes; 54 variables 14691 observations New object size: 5219544 bytes; 54 variables 14691 observations
Data frame:a_bact
14691 observations and 54 variables, maximum # NAs:7114Name | Labels | Units | Class | Storage | NAs |
---|---|---|---|---|---|
ID | Patient Identification | 1-14691 | integer | integer | 0 |
sex | Patient Sex | 1=male, 2=female | integer | integer | 0 |
Alter | Patient Age | years | integer | integer | 0 |
MCV | Mean corpuscular volume | pg | numeric | double | 42 |
HGB | Haemoglobin | G/L | numeric | double | 41 |
HCT | Haematocrit | % | numeric | double | 42 |
PLT | Blood platelets | G/L | integer | integer | 42 |
MCH | Mean corpuscular hemoglobin | fl | numeric | double | 42 |
MCHC | Mean corpuscular hemoglobin concentration | g/dl | numeric | double | 42 |
RDW | Red blood cell distribution width | % | numeric | double | 56 |
MPV | Mean platelet volume | fl | numeric | double | 702 |
LYM | Lymphocytes | G/L | numeric | double | 262 |
MONO | Monocytes | G/L | numeric | double | 246 |
EOS | Eosinophils | G/L | numeric | double | 135 |
BASO | Basophiles | G/L | numeric | double | 146 |
NT | Normotest | % | integer | integer | 2467 |
APTT | Activated partial thromboplastin time | sec | numeric | double | 2549 |
FIB | Fibrinogen | mg/dl | integer | integer | 2567 |
NA. | Sodium | mmol/L | integer | integer | 1282 |
K | Potassium | mmol/L | numeric | double | 2008 |
CA | Calcium | mmol/L | numeric | double | 1276 |
PHOS | Phosphate | mmol/L | numeric | double | 1242 |
MG | Magnesium | mmol/L | numeric | double | 1869 |
KREA | Creatinine | mg/dl | numeric | double | 159 |
BUN | Blood urea nitrogen | mg/dl | numeric | double | 172 |
HS | Uric acid | mg/dl | numeric | double | 3061 |
GBIL | Bilirubin | mg/dl | numeric | double | 1441 |
TP | Total protein | G/L | numeric | double | 1583 |
ALB | Albumin | G/L | numeric | double | 1676 |
AMY | Amylase | U/L | integer | integer | 3913 |
PAMY | Pancreas amylase | U/L | integer | integer | 7114 |
LIP | Lipases | U/L | integer | integer | 3699 |
CHE | Cholinesterase | kU/L | numeric | double | 2447 |
AP | Alkaline phosphatase | U/L | integer | integer | 1400 |
ASAT | Aspartate transaminase | U/L | integer | integer | 1154 |
ALAT | Alanin transaminase | U/L | integer | integer | 987 |
GGT | Gamma-glutamyl transpeptidase | G/L | integer | integer | 1262 |
LDH | Lactate dehydrogenase | U/L | integer | integer | 1714 |
CK | Creatinine kinases | U/L | integer | integer | 2080 |
GLU | Glucoses | mg/dl | integer | integer | 4192 |
TRIG | Triclyceride | mg/dl | integer | integer | 5061 |
CHOL | Cholesterol | mg/dl | integer | integer | 5045 |
CRP | C-reactive protein | mg/dl | numeric | double | 155 |
BASOR | Basophile ratio | % | numeric | double | 732 |
EOSR | Eosinophil ratio | % | numeric | double | 732 |
LYMR | Lymphocyte ratio | % (mg/dl) | numeric | double | 732 |
MONOR | Monocyte ratio | % | numeric | double | 732 |
NEU | Neutrophiles | G/L | numeric | double | 728 |
NEUR | Neutrophile ratio | % | numeric | double | 732 |
PDW | Platelet distribution width | % | numeric | double | 1102 |
RBC | Red blood count | T/L | numeric | double | 461 |
WBC | White blood count | G/L | numeric | double | 462 |
BloodCulture | Blood culture result for bacteremia | no, yes | character | character | 0 |
BC | bacteremia | 0/1 | integer | integer | 0 |
14.3 Section session info
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Austria.1252 LC_CTYPE=English_Austria.1252
## [3] LC_MONETARY=English_Austria.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Austria.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Hmisc_4.6-0 Formula_1.2-4 survival_3.2-13 lattice_0.20-45
## [5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.8 purrr_0.3.4
## [9] readr_2.1.2 tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.5
## [13] tidyverse_1.3.1 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 sass_0.4.1 jsonlite_1.8.0
## [4] splines_4.1.3 modelr_0.1.8 bslib_0.3.1
## [7] assertthat_0.2.1 latticeExtra_0.6-29 cellranger_1.1.0
## [10] yaml_2.3.5 pillar_1.7.0 backports_1.4.1
## [13] glue_1.6.2 digest_0.6.29 checkmate_2.0.0
## [16] RColorBrewer_1.1-2 rvest_1.0.2 colorspace_2.0-3
## [19] htmltools_0.5.2 Matrix_1.4-0 pkgconfig_2.0.3
## [22] broom_0.7.12 haven_2.4.3 bookdown_0.25
## [25] scales_1.1.1 jpeg_0.1-9 tzdb_0.2.0
## [28] htmlTable_2.4.0 generics_0.1.2 ellipsis_0.3.2
## [31] withr_2.5.0 nnet_7.3-17 cli_3.2.0
## [34] magrittr_2.0.2 crayon_1.5.1 readxl_1.3.1
## [37] evaluate_0.15 fs_1.5.2 fansi_1.0.3
## [40] xml2_1.3.3 foreign_0.8-82 data.table_1.14.2
## [43] tools_4.1.3 hms_1.1.1 lifecycle_1.0.1
## [46] munsell_0.5.0 reprex_2.0.1 cluster_2.1.2
## [49] compiler_4.1.3 jquerylib_0.1.4 rlang_1.0.2
## [52] grid_4.1.3 rstudioapi_0.13 htmlwidgets_1.5.4
## [55] base64enc_0.1-3 rmarkdown_2.13 gtable_0.3.0
## [58] DBI_1.1.2 R6_2.5.1 gridExtra_2.3
## [61] lubridate_1.8.0 knitr_1.38 fastmap_1.1.0
## [64] utf8_1.2.2 rprojroot_2.0.2 stringi_1.7.6
## [67] Rcpp_1.0.8.3 vctrs_0.3.8 rpart_4.1.16
## [70] png_0.1-7 dbplyr_2.1.1 tidyselect_1.1.2
## [73] xfun_0.30