Chapter 3 Data screening and possible actions

3.1 Univariate distributions

What to look at Possible actions: Interpretation Possible actions: SAP Possible actions: Presentation
Continuous variables General skewness Help in interpreting results Update SAP Update intended presentation of results
Continuous variables General skewness Wide CI for coefficients Use variable as log-transformed Update intended presentation of results
Continuous variables Outliers Disproportional impact on results Winsorize or transform Model involves winsorization
Continuous variables Spike at 0 Narrow CI at 0 Use appropriate representation of variable in model Use 2 (or more) coefficients to distinguish 0 from non-0 continuous part
Categorical variables Frequencies Comparisons to default reference probably irrelevant Change reference category Contrasts compare to (new) reference category
Categorical variables Rare categories Wide CI for coefficients Collapse/exclude Fewer categories to present
Categorical variables One very frequent category Comparisons irrelevant? Exclude variable Variable omitted

3.2 Bivariate distributions

What to look at Possible actions: Interpretation Possible actions: SAP Possible actions: Presentation
Continuous by continuous Outliers (from the cloud) Disproportional impact on results Winsorize or transform Model involves winsorization
Continuous by continuous Correlations Wide CI for coefficients Winsorize or transform Model involves winsorization
Continuous by categorical Outliers (only visible in bivariate plot) Wide CI for coefficients
Categorical by categorical Frequent/rare combinations Comparison to default reference irrelevant Change reference category Contrasts compare to (new) reference category
Categorical by categorical Frequent/rare combinations interactions relevant? Remove interaction from model Fewer interactions to present

3.3 Missing values

What to look at Possible actions: Interpretation Possible actions: SAP Possible actions: Presentation
Per variable Number and proportion Wide CI for coefficients Remove variable if many missing values
Pattern Variables missing independently or together Omit variables together Changes model
Pattern Variables missing dependent on levels of other variables Systematic missingness? Model still based on representative? IPW needed? Weighted analysis
Complete cases Number and proportion Few cases left for main CCO analysis Multiple imputation (or other way of dealing with missing values)? Result from MI analysis? Or applicability restricted to a subpopulation?

References

Huebner M, le Cessie S, Schmidt CO, Vach W . A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192. Link

Harrell FE. Regression Modeling Strategies. Springer (2nd ed) 2015

[…]