-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdcf_manifest_acl_fix.py
126 lines (92 loc) · 3.54 KB
/
dcf_manifest_acl_fix.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
"""
by catie bullen
dec 2022
quick fix for DEV-1578 until long-term solution implemented
context: file ACLs are queried (from indexd and/or graph) and returned to DCF manifest
for data in released projects in the GDC Data Portal after each data release
this manifest is generated by dev team
some projects have only a 'project' level ACL,
some have both 'program' and 'project' level ACLs (i.e. "parent-child studies")
and some have only program level ACL
According to dbGaP, parent-child studies use the parent study for access
(i.e. program level)
Where this can cause access issues for downstream DCF users
is for projects with both program and project level ACLs,
and the user only has access to data for
one of these ACLs (being the program level) but not both ACLs
The user cannot access data when they do not have access to both ACLs
solution: We will perform post-hoc edit to dev team generated manifest
until dev team has bandwidth to make changes to DCF generation script
This post-hoc edit will essentailly edit the ACL id list object
found in the 'acl' column of the manifest for the affected projects and files
Instead of listing both program and project level ACLs
for parent-child studies in the DCF manifest,
we will only list the program level ACL per the caveat that
"parent-child studies use the parent study (i.e. program level ACL) for access"
example: acl list object for a file in project TARGET-CCSK
would go from ["phs000466", "phs000218"] to ["phs000218"]
usage: python dcf_manifest_acl_fix.py <file> """
import json
import pandas as pd
import sys
# set up error log file and error stream
err_out = "./acl_fix_error_log.txt"
sys.stderr = open(err_out, "w")
# list of offender projects
# note these are only released projects in GDC, more projects exist that are not yet released to public
offenders = [
"CGCI-BLGSP",
"TARGET-CCSK",
"CTSP-DLBCL1",
"CGCI-HTMCP-CC",
"TARGET-ALL-P2",
"TARGET-ALL-P1",
"TARGET-AML",
"TARGET-WT",
"TARGET-OS",
"TARGET-RT",
"TARGET-NBL",
]
# project-level ACLs (want to REMOVE these from the manifest list object under acl column)
# this is for reference
project_dict = {
"CGCI-BLGSP": "phs000527",
"TARGET-CCSK": "phs000466",
"CTSP-DLBCL1": "phs001184",
"CGCI-HTMCP-CC": "phs000528",
"TARGET-ALL-P2": "phs000464",
"TARGET-ALL-P1": "phs000463",
"TARGET-AML": "phs000465",
"TARGET-WT": "phs000471",
"TARGET-OS": "phs000468",
"TARGET-RT": "phs000470",
"TARGET-NBL": "phs000467",
}
# program-level ACLs (want to KEEP/USE these in the manifest list object under acl column)
program_dict = {
"CGCI-BLGSP": "phs000235",
"TARGET-CCSK": "phs000218",
"CTSP-DLBCL1": "phs001175",
"CGCI-HTMCP-CC": "phs000235",
"TARGET-ALL-P2": "phs000218",
"TARGET-ALL-P1": "phs000218",
"TARGET-AML": "phs000218",
"TARGET-WT": "phs000218",
"TARGET-OS": "phs000218",
"TARGET-RT": "phs000218",
"TARGET-NBL": "phs000218",
}
# function to apply to dataframe for correct acl
def switcheroo(project_id):
return [program_dict[project_id]]
# read in data frame
df = pd.read_csv(sys.argv[1], sep="\t")
# separate records by whether or not belong to problem projects or not
non_offender_df = df[~df.project_id.isin(offenders)]
offender_df = df[df.project_id.isin(offenders)]
# apply function to problem records
offender_df["acl"] = offender_df["project_id"].apply(switcheroo)
# rejoin data sets
df_v2 = pd.concat([non_offender_df, offender_df])
# write back to file
df_v2.to_csv("ACL_update_" + sys.argv[1], sep="\t", index=False)