Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stage1 data cleaning #80
base: main
Are you sure you want to change the base?
Stage1 data cleaning #80
Changes from all commits
070c04d
86aacf4
598dfe1
b4a2cdf
ed2a93b
00c62bd
0b4dc29
7d04891
c1120fc
cdeb9bc
69c9dc4
09877e9
5879c81
ec1f44d
82f860e
1e4dff5
b13e556
7c25d2f
23285dd
c69e1a3
5f9811c
cb3784b
fed40ec
fe5f717
2af7334
d4b3e1a
9d97bc2
d24722d
3af70d5
82403da
6adea6f
52da575
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could make this more readable using dplyr::case_when().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will use this PR to merge preprocess and data_cleaning for better efficiency. Key steps to follow:
Apply Inclusion Criteria First: Ensure inclusion criteria are applied before quality assurance checks.
Improve Readability: Use case_when for filtering datasets to enhance clarity.
Step-by-Step Validation: Check the dataset at each stage to confirm that the code is functioning as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should discuss whether the dummy data is appropriate to test these scripts when we next meet. I have everyone born in the year 1975 in my dataset so nobody matches the first QA criterion and it is hard to tell if it is working without modifying the dummy data (which we could do).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 'patient only has year of death' refer to? As far as I can tell the code applies 'year of birth is after year of death if both are available' or 'year of birth is missing and year of death is available'. We don't want the latter logic - I think this should be 'year of birth is available and year of death is missing' (i.e., the patient is alive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is taken from the mental health repo but there are some double negatives here, I think the following logic is clearer:
A similar approach could be taken with the other QA criteria concerning year of birth and death below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this something like:
It is only on or before 1/1/1900 when
study_dates$earliest_expec=1/1/1900
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though on further reading of the protocol, I notice this criteria is listed as 'Remove individuals whose date of death is after today' and does not mention invalid death dates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decision from today's meeting (February 11, 2025): The repository will be revised to ensure consistency with the protocol regarding inclusion criteria and quality assurance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in two minds about including
| is.na(input$cov_cat_sex)
- the people with missing sex might include, for example, pregnancy/birth codes for men. This applies to all sex specific QA criteria.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, individuals with missing sex should be included here. Only pregnancy/birth codes for men (HRT or COCP meds for men; prostate cancer codes for women) are excluded. We will handle the exclusion of individuals with missing sex in
fn-inex.R
to ensure accurate counts. Excluding them here would lead to inconsistencies in the reported number of missing sex cases.