Glossary

Data analysis

Main data analysis (MDA)

Statistical approaches and methods to address research objectives.

Initial Data Analysis (IDA)

IDA is a systematic process to provide reliable knowledge about the data to determine the suitability of the data for the main data analysis. IDA is aligned with the research aims and the main data analysis and does not include hypothesis generating activities or assessing associations of predictors and outcomes.

IDA has the following phases: (1) metadata setup; (2) data cleaning; (3) data screening; (4) initial data reporting; (5) refining and updating the research analysis plan; and (6) documenting and reporting IDA. 

Source: Schmidt CO, Vach W, le Cessie S, Huebner M. STRATOS: Introducing the Initial Data Analysis Topic Group (TG3). Biometric Bulletin 2018; 35 (2): 10-11

Link

Exploratory Data Analysis

Exploratory Data Analysis (EDA) inspects data to uncover patterns in the data or identifies possible errors and anomalies, for example through graphical approaches. EDA may include investigations for model building and hypothesis generation in a cyclical manner.

Source: Lusa L, Proust-Lima C, Schmidt CO, Lee KJ, le Cessie S, Baillie M, Lawrence F, Huebner M; TG3 of the STRATOS Initiative. Initial data analysis for longitudinal studies to build a solid foundation for reproducible analysis. PLoS One. 2024 May 29;19(5):e0295726. doi: 10.1371/journal.pone.0295726.

Link

References:

Behrens JT. Principles and procedures of exploratory data analysis. Psychological methods. 1997;2(2):131. https://core.ac.uk/download/pdf/193648223.pdf

Cook D, Swayne DF. Interactive and Dynamic Graphics for Data Analysis: With R and Ggobi. New York, NY: Springer New York; 2007. https://en.wikipedia.org/wiki/Exploratory_data_analysis

Statistical Analysis Plan (SAP)

A statistical analysis plan is a document that contains a more technical and detailed elaboration of the principal features of the analysis described in the protocol and includes detailed procedures for executing the statistical analysis of the primary and secondary variables and other data.

Source: International Council for Harmonisation (ICH E9). ICH: E 9: Statistical principles for clinical trials - Step 5. Glossary p. 37

Link

Initial Data Analysis Plan (IDAP)

An IDA plan is focused on data screening to prepare for statistical analyses; it does not specify software packages to conduct the analyses. It can be understood as a minimum basic set of analyses that can be extended depending on the context of the research study and data collection. IDA domains unit missingness, item missingness, univariable and multivariable descriptions.

Source: [Heinze et al https://doi.org/10.1186/s12874-024-02294-3; Lusa et al https://doi.org/10.1371/journal.pone.0295726

Elements of initial data analysis

Data preprocessing

Data preprocessing or data preparation refers to processes involved in preparing data for an analysis project. It may involve acquiring, integrating, enriching, transforming, standardizing, checking, cleaning, formatting or structuring, or sampling data. Two phases may be distinguished, such as processes before the analyst receives the data, and activities to prepare an analysis-ready dataset by the analyst.

Data wrangling

Data Wrangling is a broad term referring to the processes involved when preparing data for analysis. It can include acquiring data, enriching, changing the format and shape of the data, combining, subsetting and sampling data, and cleaning data. Some common steps involved with Data Wrangling are: 

Discovering and gathering the data needed
Merging data from different sources, if necessary 
Fixing flaws in the data entries 
Extracting the necessary data and put it in the proper structure 
Storing it in the proper format for further use

Source: National Library of Medicine

Link

Data cleaning

Data cleaning is the process of identifying and correcting data that are inaccurate, missing, or incomplete. Data cleaning tasks can include removing duplicate records, investigating extreme values (e.g., outliers), converting dates from one format to another, removing unwanted text, splitting multiple data points in a cell into separate cells, or coding missing or NA values.

Source: National Library of Medicine

Link

Data screening

Data screening consists of reviewing and documenting the properties and quality of the data that may affect future analysis and interpretation.

Source: Huebner M, le Cessie S, Schmidt CO, Vach W. A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192.

Link

Structural variables

Structural variables are variables that help to structure initial data analysis results for a clear organization and essential overview of data properties. Structuring can be based on levels of measurement (centers), on calendar time of recruitment, on demographic variables such as sex or age, or on variables of central importance to the research questions. Structural variables may or may not be included as predictors in the statistical models.

Source: [Heinze et al https://doi.org/10.1186/s12874-024-02294-3]

Data analysis

Main data analysis (MDA)

Initial Data Analysis (IDA)

Exploratory Data Analysis

Statistical Analysis Plan (SAP)

Initial Data Analysis Plan (IDAP)

Elements of initial data analysis

Data preprocessing

Data wrangling

Data cleaning

Data screening

Structural variables

Analysis related terms

Analysis project

Target population

Observation unit

Observation windows

Index date

Analysis outputs

Data and related terms

Data

Data set

Metadata

Data dictionary

Data quality

Data subject

Data provenance