Stage1 data cleaning #80

ZoeMZou · 2025-01-16T11:11:30Z

PR: Refinements for Script Structure & Data Processing

(Based on discussions on 28th January 2025)

🔹 1. Structural Changes
📌 1.1 Naming Conventions

Removed _stage1 from scripts to improve clarity and consistency.
The old stage1_data_cleaning.R script was handling:
1. Setting reference levels for categorical variables
2. Applying quality assurance
3. Applying inclusion/exclusion criteria
The new structure separates these tasks into three dedicated functions:
- fn-ref.R → Sets reference levels
- fn-qa.R → Applies quality assurance
- fn-inex.R→ Applies inclusion/exclusion criteria
The main script data_cleaning.R calls these functions, ensuring: Project-specific edits are made in function scripts and data_cleaning.R remains clear and workflow-focused.

📌 1.2 Output Dataset Naming

The cleaned dataset output is now: input_{cohort}.rds
The preprocessed dataset output is now:
input_{cohort}.rds → input_{cohort}_0.rds (to avoid YAML conflicts)

🔹 2. fn-ref.R — Reference Level Settings

📌 2.1 Data Formatting Fixes

Some categorical variables were not factors, and some numeric variables were not numeric. I added code back to preprocess_data.R to ensure correct formatting:
🔗 See fix in preprocess_data.R

📌 2.2 Code Deletions from Old Repo
Index Date: Already correct—no correction needed.
🔗 Old repo reference
Deprivation (IMD): No need for re-categorization; new repo includes IMD (1–5) directly.
🔗 Old repo reference

📌 2.3 Sex Category Adjustments
Old repo: Only male (M) and female (F).
New repo: Four levels → female, male, intersex, unknown.
Fix:
Set non-male/non-female values to missing. Retain three levels: female, male, unknown.

🔹 3. fn-qa.R — Quality Assurance Fixes

📌 3.1 Fixing "Date of Death" Message
The original message incorrectly stated that NA values were being removed.
Fix: Message now correctly states:
"Quality assurance: Date of death is invalid (on or before 1/1/1900 or after current date)"

🔗 Old repo reference

📌 3.2 Fixing Missing Data Handling
Issue: Patients with missing sex were marked as missing for the entire record, which was unintended.
Fix: The new script preserves records and only marks sex as missing.
🔗 Updated QA code

📌 3.3 Update to qa_bin_prostate_cancer
Issue:
The current method introduced missing values due to logic operations with NA values. NA values were assigned to alive patients who had no recorded prostate cancer diagnosis but were affected by the logic.
The check for cause of death due to prostate cancer was unnecessary. Patients who died from prostate cancer before the index date are already excluded later in the study. Removing this check ensures all alive participants receive a valid classification (TRUE/FALSE) rather than NA.
🔗 Updated method

🔹 4. fn-inex.R — Inclusion/Exclusion Criteria Fixes

📌 4.1 Code Corrections
The new repo includes inex_bin_alive, so we can use it directly instead of recalculating.
🔗 Old repo reference
🔗 New implementation

📌 4.2 Variable Renaming
death_date → cens_date_death
has_follow_up_previous_6months → inex_bin_6m_reg
deregistration_date → cens_date_dereg

📌 4.3 Removal of Unnecessary Code
Issue: The old repo had an unnecessary step for active registration at index.
Fix:
Removed redundant filtering for active registration, as inex_bin_6m_reg already ensures this. I rewrote the print message to reflect this.
🔗 Old repo reference
🔗 Updated function

📌 4.4 Fix for Exclusion Criteria in the Vax Cohort
Issue: Errors in the vax cohort exclusion criteria due to incorrect variable types.
Fix:
Redefined data types to ensure they are numeric before calculating vax_mixed.
🔗 Updated fix

🔹 5. Other Updates

📌 5.1 Cause of Death Extraction Simplified
A more efficient method for extracting cause of death has become available in OpenSAFELY.
Fix: Updated our code to align with the latest OpenSAFELY documentation.
🔗 Updated function

Further consider secondary care diagnosis

1. remove opa diagnosis for hospitable admission. 2. revise covid-19 severity variable by focusing on primary diagnosis only.

To include diagnosis in any position in the function. Previously we just included primary diagnosis and first code of secondary diagnosis. The definition of Covid-19 hospitalisation `sub_date_covid19_hospital` did not use any created function as its definition is very unique and not worth creating a function for it self.

Dataset definition revision

…post-covid-respiratory into Stage1_data_cleaning

update the code for extracting death data

Exclude death from definition for qa_bin_prostate_cancer, as people who die from it before index date will be excluded from the study anyway

ZoeMZou · 2025-01-30T18:48:00Z

Hi @venexia,

This PR is now ready for your review. It runs successfully locally. Since the revisions in this script are quite detailed, I’ve listed the key changes at the beginning of the PR to make it easier to navigate and review.

Thank you very much.

Best wishes,
Zoe

venexia

Hi @ZoeMZou. Great work - well done! Just a few things to address:

Some of the QA criteria are out of date - these should be updated to match the current protocol
Some of the code for QA and inex could be simplified by using dplyr::case_when and/or removing double negatives
The naming structure for the files through the pipeline is not very intuitive - we should agree an approach for this when we next meet the post-covid-events team
The currently generated dummy data does not have the variation to test this code properly - we should resolve this so we can double check we are removing the correct people (I did this for the first QA criterion but haven't gone through each criteria in this review)
The formatting of the scripts is inconsistent and should be reviewed - we can automate this using linting but I need to look up how to set it up

venexia · 2025-02-04T16:38:35Z

analysis/data_cleaning/fn-qa.R

+  consort[nrow(consort)+1,] <- c("Quality assurance: Year of birth is before 1793 or year of birth exceeds current date",
+                                 nrow(input))
+
+  print('Quality assurance: Date of death is invalid (on or before 1/1/1900 or after current date)')


I would make this something like:

print('Quality assurance: Date of death is invalid (on or before earliest expected date or after current date)')

It is only on or before 1/1/1900 when study_dates$earliest_expec=1/1/1900.

Though on further reading of the protocol, I notice this criteria is listed as 'Remove individuals whose date of death is after today' and does not mention invalid death dates.

Decision from today's meeting (February 11, 2025): The repository will be revised to ensure consistency with the protocol regarding inclusion criteria and quality assurance.

venexia · 2025-02-04T16:42:26Z

analysis/data_cleaning/fn-qa.R

+  consort[nrow(consort)+1,] <- c("Quality assurance: Date of death is invalid (on or before 1/1/1900 or after current date)",
+                                 nrow(input))
+
+  print('Quality assurance: Pregnancy/birth codes for men')


I am in two minds about including | is.na(input$cov_cat_sex) - the people with missing sex might include, for example, pregnancy/birth codes for men. This applies to all sex specific QA criteria.

Yes, individuals with missing sex should be included here. Only pregnancy/birth codes for men (HRT or COCP meds for men; prostate cancer codes for women) are excluded. We will handle the exclusion of individuals with missing sex in fn-inex.R to ensure accurate counts. Excluding them here would lead to inconsistencies in the reported number of missing sex cases.

analysis/dataset_definition/variables_cohorts.py

analysis/data_cleaning/data_cleaning.R

venexia · 2025-02-04T16:55:57Z

analysis/data_cleaning/fn-inex.R

+    consort[nrow(consort)+1,] <- c("Inclusion criteria: Did not recieve a second dose vaccination before their first dose vaccination",
+                                    nrow(input))
+
+    print('Inclusion criteria: Did not recieve a mixed vaccine products before 07-05-2021')


I think we could make this more readable using dplyr::case_when().

We will use this PR to merge preprocess and data_cleaning for better efficiency. Key steps to follow:

Apply Inclusion Criteria First: Ensure inclusion criteria are applied before quality assurance checks.

Improve Readability: Use case_when for filtering datasets to enhance clarity.

Step-by-Step Validation: Check the dataset at each stage to confirm that the code is functioning as expected.

venexia · 2025-02-05T16:29:06Z

analysis/data_cleaning/fn-qa.R

We should discuss whether the dummy data is appropriate to test these scripts when we next meet. I have everyone born in the year 1975 in my dataset so nobody matches the first QA criterion and it is hard to tell if it is working without modifying the dummy data (which we could do).

venexia · 2025-02-05T16:30:38Z

analysis/preprocess/preprocess_data.R

@@ -112,7 +114,7 @@ df1[,colnames(df)[grepl("tmp_",colnames(df))]] <- NULL

 # Save input -------------------------------------------------------------------

-saveRDS(df1, file = paste0("output/input_",cohort_name,".rds"), compress = TRUE)
+saveRDS(df1, file = paste0("output/input_",cohort_name,"_0.rds"), compress = TRUE)


Is this intended as a permanent change or was it for testing purposes? The life course of the file is now input_cohort.csv.gz > input_cohort_0.rds > input_cohort.rds, which is a little confusing as you expect the first and third file to be the same?

Decision from today's meeting (11th Feb 2025):

We will not create an intermediate dataset from preprocess.

After merging preprocess and data_cleaning, the only output will be input_cohort_clean.rds.

analysis/dataset_definition/codelists.py

ZoeMZou added 10 commits January 7, 2025 16:56

Codelist (ICD-10) update for respiratory outcome

070c04d

Update respiratory outcomes, preex conditions, and covariates

86aacf4

Further consider secondary care diagnosis

Update active_analyses.rds

598dfe1

Delete functions for opa and ec for secondary care diagnosis

b4a2cdf

Update variables_cohorts.py

ed2a93b

1. remove opa diagnosis for hospitable admission. 2. revise covid-19 severity variable by focusing on primary diagnosis only.

Create stage_1_data_cleaning.R

0b4dc29

Update stage_1_data_cleaning.R

7d04891

Update preprocess_data.R

c1120fc

Update stage_1_data_cleaning.R

cdeb9bc

ZoeMZou linked an issue Jan 16, 2025 that may be closed by this pull request

Stage_1_data_cleaning issues #81

Open

Update variables_cohorts.py

69c9dc4

ZoeMZou mentioned this pull request Jan 16, 2025

Stage_1_data_cleaning issues #81

Open

ZoeMZou requested a review from venexia January 16, 2025 17:05

ZoeMZou and others added 16 commits January 21, 2025 14:43

Merge pull request #71 from opensafely/dataset_definition_revision

09877e9

Dataset definition revision

Create stage_1_data_cleaning.R

5879c81

Update stage_1_data_cleaning.R

ec1f44d

Update preprocess_data.R

82f860e

Update stage_1_data_cleaning.R

1e4dff5

Merge branch 'Stage1_data_cleaning' of https://github.com/opensafely/…

b13e556

…post-covid-respiratory into Stage1_data_cleaning

Update variables_cohorts.py

7c25d2f

Create fn-inex.R

23285dd

Create fn-qa.R

c69e1a3

Update functions for data cleaning

5f9811c

update data_cleaning

cb3784b

Update fn-qa.R

fed40ec

update data_cleaning using R functions

fe5f717

Update variable_helper_functions.py

2af7334

update the code for extracting death data

Update variables_cohorts.py

d4b3e1a

Exclude death from definition for qa_bin_prostate_cancer, as people who die from it before index date will be excluded from the study anyway

Update variable_helper_functions.py

9d97bc2

ZoeMZou added 4 commits January 28, 2025 18:36

Update fn-qa.R

d24722d

Update data_cleaning

3af70d5

Update data_cleaning.R

82403da

Update fn-inex.R

6adea6f

ZoeMZou requested review from venexia and removed request for venexia January 30, 2025 12:14

Update fn-inex.R

52da575

ZoeMZou marked this pull request as ready for review January 30, 2025 18:48

venexia requested changes Feb 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage1 data cleaning #80

Stage1 data cleaning #80

ZoeMZou commented Jan 16, 2025 •

edited

Loading

ZoeMZou commented Jan 30, 2025

venexia left a comment •

edited

Loading

venexia Feb 4, 2025

venexia Feb 5, 2025

ZoeMZou Feb 10, 2025 •

edited

Loading

venexia Feb 4, 2025

ZoeMZou Feb 10, 2025

venexia Feb 4, 2025

ZoeMZou Feb 11, 2025

venexia Feb 5, 2025

venexia Feb 5, 2025

ZoeMZou Feb 11, 2025

Stage1 data cleaning #80

Are you sure you want to change the base?

Stage1 data cleaning #80

Conversation

ZoeMZou commented Jan 16, 2025 • edited Loading

PR: Refinements for Script Structure & Data Processing

ZoeMZou commented Jan 30, 2025

venexia left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZoeMZou Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZoeMZou commented Jan 16, 2025 •

edited

Loading

venexia left a comment •

edited

Loading

ZoeMZou Feb 10, 2025 •

edited

Loading