3 Data screening and possible actions
3.1 Univariate distributions
What to look at | Possible actions: Interpretation | Possible actions: SAP | Possible actions: Presentation | |
---|---|---|---|---|
Continuous variables | General skewness | Help in interpreting results | Update SAP | Update intended presentation of results |
Continuous variables | General skewness | Wide CI for coefficients | Use variable as log-transformed | Update intended presentation of results |
Continuous variables | Outliers | Disproportional impact on results | Winsorize or transform | Model involves winsorization |
Continuous variables | Spike at 0 | Narrow CI at 0 | Use appropriate representation of variable in model | Use 2 (or more) coefficients to distinguish 0 from non-0 continuous part |
Categorical variables | Frequencies | Comparisons to default reference probably irrelevant | Change reference category | Contrasts compare to (new) reference category |
Categorical variables | Rare categories | Wide CI for coefficients | Collapse/exclude | Fewer categories to present |
Categorical variables | One very frequent category | Comparisons irrelevant? | Exclude variable | Variable omitted |
3.2 Bivariate distributions
What to look at | Possible actions: Interpretation | Possible actions: SAP | Possible actions: Presentation | |
---|---|---|---|---|
Continuous by continuous | Outliers (from the cloud) | Disproportional impact on results | Winsorize or transform | Model involves winsorization |
Continuous by continuous | Correlations | Wide CI for coefficients | Winsorize or transform | Model involves winsorization |
Continuous by categorical | Outliers (only visible in bivariate plot) | Wide CI for coefficients | ||
Categorical by categorical | Frequent/rare combinations | Comparison to default reference irrelevant | Change reference category | Contrasts compare to (new) reference category |
Categorical by categorical | Frequent/rare combinations | interactions relevant? | Remove interaction from model | Fewer interactions to present |
3.3 Missing values
What to look at | Possible actions: Interpretation | Possible actions: SAP | Possible actions: Presentation | |
---|---|---|---|---|
Per variable | Number and proportion | Wide CI for coefficients | Remove variable if many missing values | |
Pattern | Variables missing independently or together | Omit variables together | Changes model | |
Pattern | Variables missing dependent on levels of other variables | Systematic missingness? Model still based on representative? | IPW needed? | Weighted analysis |
Complete cases | Number and proportion | Few cases left for main CCO analysis | Multiple imputation (or other way of dealing with missing values)? | Result from MI analysis? Or applicability restricted to a subpopulation? |
References
Huebner M, le Cessie S, Schmidt CO, Vach W . A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192. Link
Harrell FE. Regression Modeling Strategies. Springer (2nd ed) 2015
[…]