Glossary
Initial data analysis framework
Initial Data Analysis
The main aim of IDA is seen in providing reliable knowledge about the data to enable responsible statistical analyses and interpretation. IDA has the following phases: (1) metadata setup; (2) data cleaning; (3) data screening; (4) initial data reporting; (5) refining and updating the research analysis plan; and (6) documenting and reporting IDA. IDA is aligned with the research aims and the statistical analysis plan and does not include hypothesis generating activities.
Source: Schmidt CO, Vach W, le Cessie S, Huebner M. STRATOS: Introducing the Initial Data Analysis Topic Group (TG3). Biometric Bulletin 2018; 35 (2): 10-11
Exploratory Data Analysis
Exploratory Data Analysis (EDA) inspects data to uncover patterns in the data or identifies possible errors and anomalies, for example through graphical approaches. EDA may include investigations for model building and hypothesis generation in a cyclical manner.
Source: Lusa L, Proust-Lima C, Schmidt CO, Lee KJ, le Cessie S, Baillie M, Lawrence F, Huebner M; TG3 of the STRATOS Initiative. Initial data analysis for longitudinal studies to build a solid foundation for reproducible analysis. PLoS One. 2024 May 29;19(5):e0295726. doi: 10.1371/journal.pone.0295726.
References:
Behrens JT. Principles and procedures of exploratory data analysis. Psychological methods. 1997;2(2):131. https://core.ac.uk/download/pdf/193648223.pdf
Cook D, Swayne DF. Interactive and Dynamic Graphics for Data Analysis: With R and Ggobi. New York, NY: Springer New York; 2007. https://en.wikipedia.org/wiki/Exploratory_data_analysis
Statistical Analysis Plan (SAP)
A statistical analysis plan is a document that contains a more technical and detailed elaboration of the principal features of the analysis described in the protocol and includes detailed procedures for executing the statistical analysis of the primary and secondary variables and other data.
Source: International Council for Harmonisation (ICH E9). ICH: E 9: Statistical principles for clinical trials - Step 5. Glossary p. 37
Initial Data Analysis Plan (IDAP)
An IDA plan is focused on data screening to prepare for statistical analyses; it does not specify procedures or software packages to conduct the analyses. It can be understood as a minimum basic set of analyses that can be extended depending on the context of the research study and data collection. IDA domains missing values, univariate and multivariate descriptions.
Source: [Regression without regrets paper - to appear; and Lusa et al doi: 10.1371/journal.pone.0295726.
Elements of initial data analysis
Data wrangling
Data Wrangling is a broad term referring to the processes involved when preparing data for analysis. It can include acquiring data, enriching, changing the format and shape of the data, combining, subsetting and sampling data, and cleaning data. Some common steps involved with Data Wrangling are:
- Discovering and gathering the data needed
- Merging data from different sources, if necessary
- Fixing flaws in the data entries
- Extracting the necessary data and put it in the proper structure
- Storing it in the proper format for further use
Source: National Library of Medicine
Data cleaning
Data cleaning is the process of identifying and correcting data that are inaccurate, missing, or incomplete. Data cleaning tasks can include removing duplicate records, investigating extreme values (e.g., outliers), converting dates from one format to another, removing unwanted text, splitting multiple data points in a cell into separate cells, or coding missing or NA values.
Source: National Library of Medicine
Data screening
Data screening consists of reviewing and documenting the properties and quality of the data that may affect future analysis and interpretation.
Source: Huebner M, le Cessie S, Schmidt CO, Vach W. A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192.
Structural variables
Structural variables are variables that help to structure initial data analysis results for a clear organization and essential overview of data properties. Structuring can be based on levels of measurement (centers), on calendar time of recruitment, on demographic variables such as sex or age, or on variables of central importance to the research questions. Structural variables may or may not be included as predictors in the statistical models.
Source: [Regression without regrets paper - to appear]