5  Univariate distribution checks

This section reports a series of univariate summary checks of the bacteremia dataset.

5.1 U1: Categorical variables

Age group, sex and bacteremia status are described by frequencies and proportions in each category.

Category Count Proportion
Age group
[16, 50] 5365 0.37
(65, 101] 5076 0.35
(50, 65] 4250 0.29
Sex
male 8536 0.58
female 6155 0.42
Presence of bacteremia
no 13511 0.92
yes 1180 0.08

Also plot the categories as simple bar charts.

Summary of categorical variables including outcome

5.2 Continuous variables

5.2.1 U2: Univariate distributions of continuous variables

5.2.1.1 U2: Structural variables

The only structural continuous variable is AGE. This variable is also a key predictor and reported in the following section (see below).

5.2.1.2 U2: Key predictors

Distribution of key predictors. Lines indicate the 5-number summary including reported numerical values (where possible).

The remaining predictors are reported in the appendix Section E.1.1.

5.2.1.3 U2: Predictors of medium importance

5.2.2 Numerical summaries

5.2.2.1 Key predictors

key_predictors Descriptives
key_predictors

6 Variables   14691 Observations

PLT: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
14649427181220130.1 50 81140204277369445
lowest : 0 1 2 3 4 , highest: 1068 1211 1321 1639 2092
CREA: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1453215967411.3290.85180.6200.6900.8101.0001.3502.1603.144
lowest : 0.26 0.27 0.28 0.29 0.3 , highest: 15.24 15.4 15.67 16.64 20.75
BUN: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
14519172947122.6616.92 7.1 8.611.616.626.944.860.8
lowest : 2.5 2.7 2.8 2.9 3 , highest: 160.6 171.3 171.9 176 184.8
NEU: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1396372837418.3675.776 1.60 2.70 4.60 7.3010.8015.0818.40
lowest : 0 0.1 0.2 0.3 0.4 , highest: 54 56.4 63.7 71.6 83.8
WBC: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
142294622710111.237.602 2.66 4.26 6.63 9.6013.5318.2222.27
lowest : 0 0.01 0.02 0.03 0.04 , highest: 365.3 383.74 387.73 433.83 604.47
AGE: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
14691085156.1720.7824294358707984
lowest : 16 17 18 19 20 , highest: 96 97 98 99 101

5.2.2.2 Predictors of medium importance

medium_predictors Descriptives
medium_predictors

6 Variables   14691 Observations

FIB: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
12124256710841547.4231247301397529674816892
lowest : 55 60 66 67 69 , highest: 1506 1508 1529 1537 1593
POTASS: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
12683200840814.0030.60043.203.393.663.954.294.674.92
lowest : 1.92 2.07 2.11 2.12 2.21 , highest: 8.57 11.34 13.55 14.6 36.62
ASAT: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
135371154650186.9115.6 15 17 22 31 56121218
lowest : 3 5 6 7 8 , highest: 10845 11928 12079 12380 13991
ALAT: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
13704987578167.6690.07 9 11 16 26 48101175
lowest : 0 1 2 3 4 , highest: 7109 9136 9314 12329 15059
GGT: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1342912628581115.1141.3 13.0 16.0 25.0 49.0117.0262.2429.0
lowest : 3 5 6 7 8 , highest: 2932 3303 3782 3919 5171
CRP: Parameter analysis value (Numeric)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
145361553328110.9210.39 0.29 0.77 2.87 8.5716.4524.4929.61
lowest : 0 0.01 0.02 0.03 0.04 , highest: 58.4 61.81 63.96 73.04 76.32

5.2.3 Suggested transformations

Next we investigate whether a pseudolog transformation of continuous variables may substantially symmetrize the univariate distributions of the continuous variables, and may hence be useful for multivariate summaries. We employ a function ida_trans for this purpose, which optimises the parameter sigma of the pseudo-logarithm for that purpose. The optimization targets the best possible linear correlation of the transformed values with normal deviates. If no better transformation can be found, or if the improvement in correlation is less than 0.2 correlation units, no transformation is suggested.

Display the proposed variable transformations and the new parameter codes.

PARAMCD n
ALAT_T 14691
AMY_T 14691
AP_T 14691
ASAT_T 14691
BASO_T 14691
CK_T 14691
CREA_T 14691
EOS_T 14691
GBIL_T 14691
GGT_T 14691
LDH_T 14691
LIP_T 14691
LYM_T 14691
PAMY_T 14691
WBC_T 14691

Register transformed variables in the data set. The updated data set with suggested log transformed data sets is saved at data/IDA/ADLB_02.rds.

The IDA analysis plan and specifications are updated with the proposed variable transformations. A new flag is derived to indicate the categorization of predictors now including transformations.

5.2.4 Comparison of univariate distributions with and without pseudo-log transformation

The comparison is only shown for variables where a transformation is suggested. Note, all observed values, and the distribution min, max and interquartile range as reference lines, are displayed.