3  Analysis ready data for IDA

Based on the IDA plan, this section prepares the source data to be analysis ready: read, clean, tidy and transform. This section focuses on the steps prior to IDA (data screening) and the required additions to the source data in order to prepare the data. The aim is to produce an analysis ready data set for the research objective.

3.1 Analysis ready dataset

The aim of this section and the remaining chapters of the report are to document the steps taken towards transforming the source data set to an analysis ready data set. These are the steps prior to the IDA analysis plan being executed.

The steps taken in this section are guided by the data set specification for the analysis ready data set, which is based on review of the IDA and analysis strategy.

For example, additional meta-data, data derivations and indicator flags are added to the source data set.

To support IDA, it is important that we keep track of the changes to the source data including all new modifications, data derivations and transformations. Therefore, we store references to the source data in the data folder after adding additional meta-data for all variables.

The format of the analysis ready data set follows that of the analysis ready CDISC data model.

3.1.1 Data set transformations

Important meta-data is added to the data set from the data dictionary. At this stage we could select the variables of interest to take in to the IDA phase by dropping variables we do not check in IDA.

First the source data set and corresponding data dictionary are loaded. The variable names are normalized.

A new variable is created called PARAMCD which stores the abbreviated variable name as a reference. This provides a link between the dictionary and source data.

The source data is then transformed into a long format to store the lab specific measurements called ADLB. A long format enables efficient data processing and also allows new transformed variables to be added during the course of IDA. Structural and demographic variables will be stored in a wide format in a separate data set named ADSL.

3.1.2 Add lab variable meta-data

The lab parameter variable information such as labels and units on to the transformed data are added directly from the data dictionary.

At this point, additional variables and metadata are derived per the analysis plan including:

  • Units
  • Variable type
  • Categories for sex
  • rename outcome to be more informative

3.1.3 Reorder variables

Select and re-order the variables as per the data set specification.

3.1.4 Add informative variable meta-data

Add variable metadata as label attributes.

3.1.5 Visual check outcome is correct

Visual check we have not introduced any errors with the outcome variable. First display marginal distribution from source data variable.

BloodCulture n
no 675550
yes 59000

Second, display marginal distribution from transformed data variable.

Blood culture result for bacteremia (Character coding) n
no 675550
yes 59000

3.1.6 Derive indicator flags

The next step is to derive indicator flags for predictors as per the IDA plan (see Section 2.2):

  • age (AGE), leukocytes (WBC), blood urea neutrogen (BUN), creatinine (CREA), thrombocytes (PLT), and neutrophiles (NEU) and these predictors will be included in the model as key predictors

  • Predictors of medium importance are potassium (POTASS), and some acute-phase related parameters such as fibrinogen (FIB), C-reactive protein (CRP), aspartate transaminase (ASAT), alanine transaminase (ALAT), and gamma-glutamyl transpeptidase (GGT).

Next step, add metadata flags to indicate relationship between blood cell parameters. See Section 2.2.

3.1.7 Data derivations

Now, derive age groups. For the purpose of stratifying IDA results by age, age will be categorized into the following three groups (Section 2.2):

  • [16, 50],
  • (50, 65],
  • (65, 101].

3.1.8 Save analysis ready data for IDA

Save the analysis data sets in to two linked data sets following a structure similar to the CDISC ADaM data standard. Individual patient measurements are stored in a data set called ADSL. The lab specific data sets are stored in ADLB (a long format data set).

Saving ADSL into an intermediate location DATA/IDA/ADSL_01.rds prior to IDA.

Saving ADLB into an intermediate location DATA/IDA/ADLB_01.rds prior to IDA.

Note: At this stage of IDA, both ADSL and ADLB are intermediate files that will be used for further IDA. Findings in IDA may require updates to either data set.