Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEN-1516, GEN-1555, GEN-1576] make table update cohort specific #158

Merged
merged 43 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
20fd5f9
make table update cohort specific and allow saving files to staging
danlu1 Oct 1, 2024
80bcfec
utilize Staging projects and tables so no custom table creation scrip…
danlu1 Oct 2, 2024
d27734a
remove unwanted line in config file
danlu1 Oct 2, 2024
b7281cd
update docstrings
danlu1 Oct 2, 2024
f049c3f
Update utilities.py
danlu1 Oct 2, 2024
7eb0d36
Update utilities.py
danlu1 Oct 2, 2024
936c578
reformat code
danlu1 Oct 2, 2024
1057ac7
Merge branch 'GEN-1516-table_update_cohort_specific' of github.com:Sa…
danlu1 Oct 2, 2024
993c066
remove changes in utilities
danlu1 Oct 3, 2024
01398af
add missing parameter
danlu1 Oct 3, 2024
b270a4b
add type hint and update docstring
danlu1 Oct 3, 2024
a57182b
add new parameters to nextflow script
danlu1 Oct 3, 2024
f40edfa
Update README.md
danlu1 Oct 3, 2024
39fe512
add NA list to asDataFrame function in download_synapse_table
danlu1 Oct 8, 2024
19afbf5
remove dry-run
danlu1 Oct 14, 2024
8966b3b
update synapseclient version
danlu1 Oct 14, 2024
705617b
add return type hint
danlu1 Oct 14, 2024
57e796f
add test cases
danlu1 Oct 14, 2024
ee37d5a
remove unused values
danlu1 Oct 14, 2024
deca7e7
add cohort parameter to update_data_table
danlu1 Oct 18, 2024
7825ebc
upgrade python version for synapseclient 4.6
danlu1 Oct 18, 2024
992f506
bump Python version to 3.11
danlu1 Oct 23, 2024
158296a
fix conflict to main.nf
danlu1 Oct 23, 2024
2df2190
pulled changes from develop and add cohort param to update_data_updat…
danlu1 Oct 23, 2024
9521330
add ddefault value description to download_synapse_table
danlu1 Oct 23, 2024
e2651f4
bump checkout action version
danlu1 Oct 23, 2024
70a3a0e
Update build-docker-images.yml
danlu1 Oct 24, 2024
fe717c6
test to create docker image
danlu1 Oct 24, 2024
e53df57
add table_update docker to nextflow pipeline
danlu1 Oct 24, 2024
b89e608
re-organize and add test cases for download_synapse_table
danlu1 Oct 24, 2024
7d2a000
add contributing instructions
danlu1 Oct 24, 2024
f711e29
add arg names to download_synapse_table
danlu1 Oct 24, 2024
19bcc84
add instructions for nextflow testing
danlu1 Oct 24, 2024
8c20d90
Update CONTRIBUTING.md
danlu1 Oct 24, 2024
a7ff3ab
Update CONTRIBUTING.md
danlu1 Oct 24, 2024
3e11436
reformat code
danlu1 Oct 24, 2024
f0e9cb8
Merge branch 'gen-1516-table_update_cohort_specific' of github.com:Sa…
danlu1 Oct 24, 2024
c3aa727
add test cases
danlu1 Oct 25, 2024
961d47c
add function to remove unwanted backslash in Ca Directed Drugs table
danlu1 Oct 28, 2024
edd22ea
reformat tests/test_utilities.py
danlu1 Oct 28, 2024
665b77a
Update CONTRIBUTING.md
danlu1 Oct 29, 2024
7ed8bc1
make remove_backslash column specific
danlu1 Oct 30, 2024
71b8ff7
Merge branch 'gen-1516-table_update_cohort_specific' of github.com:Sa…
danlu1 Oct 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build-docker-images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:

steps:
- name: Checkout repository
uses: actions/checkout@v3
danlu1 marked this conversation as resolved.
Show resolved Hide resolved
uses: actions/checkout@v4
with:
fetch-depth: 2

Expand Down
23 changes: 23 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,28 @@
# genie-bpc-pipeline: Contributing Guidelines

## Getting started
danlu1 marked this conversation as resolved.
Show resolved Hide resolved
1. [Clone the repository](https://help.github.com/articles/cloning-a-repository/) to your local machine so you can begin making changes.
2. On your local machine make sure you have the latest version of the `develop` branch:

```
git checkout develop
git pull origin develop
```
3. Create a feature branch off the `develop` branch and work on it. The branch should be named the same as the JIRA issue you are working on in **lowercase** (e.g., `gen-1234-{feature-here}`). Make sure the branch name as informative as possible.
```
git checkout develop
git checkout -b gen-1234-{feature-here}
```
4. Once you have made your additions or changes, make sure you write tests and run the [comparison scripts](https://github.com/Sage-Bionetworks/Genie_processing/blob/create_generalized_comparison_script/utility_scripts/compare_between_two_synapse_entities.py) to ensure changes are expected.
rxu17 marked this conversation as resolved.
Show resolved Hide resolved
5. At this point, you have only created the branch locally, you need to push this to your fork on GitHub.

```
git add your file
git commit -m"your commit information"
danlu1 marked this conversation as resolved.
Show resolved Hide resolved
git push --set-upstream origin SYNPY-1234-{feature-here}
danlu1 marked this conversation as resolved.
Show resolved Hide resolved
```
6. Create a pull request from the feature branch to the develop branch. An Github action will be triggered to create a docker image for the branch, you can check it [here](https://github.com/Sage-Bionetworks/genie-bpc-pipeline/pkgs/container/genie-bpc-pipeline).
danlu1 marked this conversation as resolved.
Show resolved Hide resolved

## Nextflow Pipeline contribution

Here is how to contribute to the nextflow workflow of the genie-bpc-pipeline
Expand Down
9 changes: 5 additions & 4 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -93,15 +93,16 @@ workflow BPC_PIPELINE {
ch_comment = Channel.value(params.comment)

if (params.step == "update_potential_phi_fields_table") {
update_potential_phi_fields_table(ch_comment, params.production)
// validate_data.out.view()
} else if (params.step == "genie_bpc_pipeline"){
update_potential_phi_fields_table(ch_comment, params.production)// validate_data.out.view()
} else if (params.step == "update_data_table") {
danlu1 marked this conversation as resolved.
Show resolved Hide resolved
update_data_table("default", ch_cohort, ch_comment, params.production)
} else if (params.step == "genie_bpc_pipeline"){
update_potential_phi_fields_table(ch_comment, params.production)
run_quac_upload_report_error(update_potential_phi_fields_table.out, ch_cohort)
run_quac_upload_report_warning(run_quac_upload_report_error.out, ch_cohort, params.production)
merge_and_uncode_rca_uploads(run_quac_upload_report_warning.out, ch_cohort, params.production)
// remove_patients_from_merged(merge_and_uncode_rca_uploads.out, ch_cohort, params.production)
update_data_table(merge_and_uncode_rca_uploads.out, ch_comment, params.production)
update_data_table(merge_and_uncode_rca_uploads.out, ch_cohort, ch_comment, params.production)
update_date_tracking_table(update_data_table.out, ch_cohort, ch_comment, params.production)
run_quac_table_report(update_date_tracking_table.out, ch_cohort, params.production)
run_quac_comparison_report(run_quac_table_report.out, ch_cohort, params.production)
Expand Down
10 changes: 5 additions & 5 deletions modules/update_data_table.nf
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@
Update Synapse tables with merged and uncoded data.
*/
process update_data_table {
container "$params.table_updates_docker"

container 'sagebionetworks/genie-bpc-pipeline-table-updates'
secret 'SYNAPSE_AUTH_TOKEN'
debug true

input:
val previous
val cohort
val comment
val production

Expand All @@ -19,13 +20,12 @@ process update_data_table {
if (production) {
"""
cd /root/scripts/
python update_data_table.py -p /root/scripts/config.json -m "$comment" primary
python update_data_table.py -p /root/scripts/config.json -c $cohort -m "$comment" primary -pd
"""
}
else {
} else {
"""
cd /root/scripts/
python update_data_table.py -p /root/scripts/config.json -m "$comment" primary -d
python update_data_table.py -p /root/scripts/config.json -c $cohort -m "$comment" primary
"""
}
}
1 change: 1 addition & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ profiles {
params {
// docker image parameters, see nextflow_schema.json for details
references_docker = "sagebionetworks/genie-bpc-pipeline-references"
table_updates_docker = "sagebionetworks/genie-bpc-pipeline-table-updates"
danlu1 marked this conversation as resolved.
Show resolved Hide resolved
}
}
}
7 changes: 6 additions & 1 deletion nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -50,13 +50,18 @@
"description": "Available BPC steps",
"enum": [
"update_potential_phi_fields_table",
"genie_bpc_pipeline"
"genie_bpc_pipeline",
danlu1 marked this conversation as resolved.
Show resolved Hide resolved
"update_data_table"
]
},
"references_docker":{
"type": "string",
"description": "Name of docker to use in processes in scripts/references"
},
"table_updates_docker":{
"type": "string",
"description": "Name of docker to use in processes in scripts/table_updates"
},
"schema_ignore_params": {
"type": "string",
"description": "Put parameters to ignore for validation here separated by comma",
Expand Down
2 changes: 1 addition & 1 deletion scripts/table_updates/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM python:3.8
FROM python:3.11

WORKDIR /root/scripts

Expand Down
12 changes: 10 additions & 2 deletions scripts/table_updates/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,14 @@ Usage

### Update the Synapse Tables with data
#### Primary Case Tables
python update_data_table.py -m [version_comment] primary
#### IRR Case Tables
##### 1. dry-run (Save output to local)
python update_data_table.py -c [cohort_name] -m [version_comment] primary -d

##### 2. production (Save output to production projects)
python update_data_table.py -c [cohort_name] -m [version_comment] primary -pd

##### 3. staging (Save output to staging projects)
python update_data_table.py -c [cohort_name] -m [version_comment] primary

#### IRR Case Tables (Deprecated)
python update_data_table.py -m [version_comment] irr
2 changes: 1 addition & 1 deletion scripts/table_updates/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
synapseclient[pandas] == 2.7.2
synapseclient[pandas] == 4.6.0
67 changes: 67 additions & 0 deletions scripts/table_updates/tests/test_utilities.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
from unittest.mock import MagicMock, create_autospec, patch

import numpy as np
import pandas as pd
import pytest
import synapseclient
from synapseclient import Schema, Table
from table_updates import utilities


@pytest.fixture(scope="session")
danlu1 marked this conversation as resolved.
Show resolved Hide resolved
def syn():
return create_autospec(synapseclient.Synapse)

@pytest.fixture(scope="session")
def table_schema():
schema = synapseclient.table.Schema(
name="test_table",
parent="syn123",
column_names=["col1", "col2"],
column_types=["STRING", "INTEGER"],
)
return schema


@pytest.mark.parametrize(
"query_return_df,select,query,expected_df",
[
(pd.DataFrame({'col1': ['value1', 'value2']}), "col1", "SELECT col1 from syn123456",pd.DataFrame({'col1': ['value1', 'value2']})),
(pd.DataFrame({'col1': ['value1', 'value2'],'col2': [1, 2]}), "col1,col2", "SELECT col1,col2 from syn123456",pd.DataFrame({'col1': ['value1', 'value2'],'col2': [1, 2]})),
(pd.DataFrame({'col1': ["NA", "value1", "None"],'col2': [1, 2, 3]}),"*","SELECT * from syn123456",pd.DataFrame({'col1': [np.nan, "value1", "None"],'col2': [1, 2, 3]})),
(pd.DataFrame(columns = ["col1", "col2"]),"*","SELECT * from syn123456",pd.DataFrame(columns = ["col1", "col2"])),
],
ids = ["selected_single_column","selected_multiple_column","pull_table_with_na_values_all_columns","pull_empty_table_all_columns"],
)
def test_download_synapse_table_default_condition(syn, table_schema, query_return_df, select, query, expected_df):
syn.tableQuery = MagicMock(return_value = Table(table_schema, query_return_df))
result = utilities.download_synapse_table(syn, "syn123456", select)

# validate
syn.tableQuery.assert_called_once_with(query)
pd.testing.assert_frame_equal(result, expected_df)

@pytest.mark.parametrize(
"query_return_df,condition,query,expected_df",
[
(pd.DataFrame({'col1': ['value1'],'col2': [1]}), "col1 = 'value1'", "SELECT * from syn123456 WHERE col1 = 'value1'",pd.DataFrame({'col1': ['value1'],'col2': [1]})),
(pd.DataFrame({'col1': ["NA", "value1", "None"],'col2': [1, 1, 1]}), "col2 = 1","SELECT * from syn123456 WHERE col2 = 1",pd.DataFrame({'col1': [np.nan, "value1", "None"],'col2': [1, 1, 1]})),
],
ids = ["selected_row_all_columns","pull_table_with_na_values_all_columns"],
)
def test_download_synapse_table_with_condition(syn, table_schema, query_return_df, condition, query,expected_df):
danlu1 marked this conversation as resolved.
Show resolved Hide resolved
syn.tableQuery = MagicMock(return_value = Table(table_schema, query_return_df))
result = utilities.download_synapse_table(syn, "syn123456", condition = condition)

# validate
syn.tableQuery.assert_called_once_with(query)
pd.testing.assert_frame_equal(result, expected_df)

def test_download_empty_synapse_table_with_condition(syn, table_schema, ):
syn.tableQuery = MagicMock(return_value = Table(table_schema, pd.DataFrame(columns = ["col1", "col2"])))
result = utilities.download_synapse_table(syn, "syn123456", condition = "col2 = 1")

# validate
syn.tableQuery.assert_called_once_with("SELECT * from syn123456 WHERE col2 = 1")
pd.testing.assert_frame_equal(result, pd.DataFrame(columns = ["col1", "col2"]))

Loading