Glossary

Initial data analysis framework

Initial Data Analysis

The main aim of IDA is seen in providing reliable knowledge about the data to enable responsible statistical analyses and interpretation. IDA has the following phases: (1) metadata setup; (2) data cleaning; (3) data screening; (4) initial data reporting; (5) refining and updating the research analysis plan; and (6) documenting and reporting IDA. IDA is aligned with the research aims and the statistical analysis plan and does not include hypothesis generating activities.

Source: Schmidt CO, Vach W, le Cessie S, Huebner M. STRATOS: Introducing the Initial Data Analysis Topic Group (TG3). Biometric Bulletin 2018; 35 (2): 10-11

Link

Exploratory Data Analysis

Exploratory Data Analysis (EDA) inspects data to uncover patterns in the data or identifies possible errors and anomalies, for example through graphical approaches. EDA may include investigations for model building and hypothesis generation in a cyclical manner.

Source: Lusa L, Proust-Lima C, Schmidt CO, Lee KJ, le Cessie S, Baillie M, Lawrence F, Huebner M; TG3 of the STRATOS Initiative. Initial data analysis for longitudinal studies to build a solid foundation for reproducible analysis. PLoS One. 2024 May 29;19(5):e0295726. doi: 10.1371/journal.pone.0295726.

Link

References:

Behrens JT. Principles and procedures of exploratory data analysis. Psychological methods. 1997;2(2):131. https://core.ac.uk/download/pdf/193648223.pdf

Cook D, Swayne DF. Interactive and Dynamic Graphics for Data Analysis: With R and Ggobi. New York, NY: Springer New York; 2007. https://en.wikipedia.org/wiki/Exploratory_data_analysis

Statistical Analysis Plan (SAP)

A statistical analysis plan is a document that contains a more technical and detailed elaboration of the principal features of the analysis described in the protocol and includes detailed procedures for executing the statistical analysis of the primary and secondary variables and other data.

Source: International Council for Harmonisation (ICH E9). ICH: E 9: Statistical principles for clinical trials - Step 5. Glossary p. 37

Link

Initial Data Analysis Plan (IDAP)

An IDA plan is focused on data screening to prepare for statistical analyses; it does not specify procedures or software packages to conduct the analyses. It can be understood as a minimum basic set of analyses that can be extended depending on the context of the research study and data collection. IDA domains missing values, univariable and multivariable descriptions.

Source: [Heinze et al https://doi.org/10.1186/s12874-024-02294-3; Lusa et al https://doi.org/10.1371/journal.pone.0295726

Elements of initial data analysis

Data wrangling

Data Wrangling is a broad term referring to the processes involved when preparing data for analysis. It can include acquiring data, enriching, changing the format and shape of the data, combining, subsetting and sampling data, and cleaning data. Some common steps involved with Data Wrangling are: 

  • Discovering and gathering the data needed
  • Merging data from different sources, if necessary 
  • Fixing flaws in the data entries 
  • Extracting the necessary data and put it in the proper structure 
  • Storing it in the proper format for further use

Source: National Library of Medicine

Link

Data cleaning

Data cleaning is the process of identifying and correcting data that are inaccurate, missing, or incomplete. Data cleaning tasks can include removing duplicate records, investigating extreme values (e.g., outliers), converting dates from one format to another, removing unwanted text, splitting multiple data points in a cell into separate cells, or coding missing or NA values.

Source: National Library of Medicine

Link

Data screening

Data screening consists of reviewing and documenting the properties and quality of the data that may affect future analysis and interpretation.

Source: Huebner M, le Cessie S, Schmidt CO, Vach W. A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192.

Link

Structural variables

Structural variables are variables that help to structure initial data analysis results for a clear organization and essential overview of data properties. Structuring can be based on levels of measurement (centers), on calendar time of recruitment, on demographic variables such as sex or age, or on variables of central importance to the research questions. Structural variables may or may not be included as predictors in the statistical models.

Source: [Heinze et al https://doi.org/10.1186/s12874-024-02294-3]