diff --git a/.gitignore b/.gitignore
index a9b7dc7..7e23695 100644
--- a/.gitignore
+++ b/.gitignore
@@ -154,3 +154,6 @@ dmypy.json
cython_debug/
# End of https://www.toptal.com/developers/gitignore/api/python,jupyternotebooks
+
+#Mac files
+.DS_Store
diff --git a/docs/requirements.txt b/docs/requirements.txt
deleted file mode 100644
index f4895b9..0000000
--- a/docs/requirements.txt
+++ /dev/null
@@ -1,3 +0,0 @@
-sphinx==4.1.2
-sphinx-rtd-theme==0.5.2
-myst-parser==0.15.1
diff --git a/docs/source/ETL.md b/docs/source/ETL.md
deleted file mode 100644
index c7b9835..0000000
--- a/docs/source/ETL.md
+++ /dev/null
@@ -1,890 +0,0 @@
-# CDA Extraction Transfer and Load (ETL) Documentation
-## Introduction
-The goal of this document is to record in greater detail the ETL process which the CDA uses to create the aggregated data tables which the API layer queries. A brief overview of the process to generate an endpoint table is seen below in Fig.1. Data from the Data Commons (DC), [GDC](https://portal.gdc.cancer.gov/) and [PDC](https://pdc.cancer.gov/pdc/), undergo a similar process including extraction using publicly available API’s, and transformation into a structure based on the CCDH model. [IDC](https://portal.imaging.datacommons.cancer.gov/) data is queried and transformed using a single BigQuery query. The results of this query are saved and merged with the transformed GDC and PDC data and uploaded to BigQuery as a table that is queried by the CDA API.
-
-## Data extraction and release information
-To identify the current version and release dates for each of the database, you can run the following command:
-
-```
-r = Q.sql("SELECT option_value FROM `gdc-bq-sample.integration.INFORMATION_SCHEMA.TABLE_OPTIONS` WHERE table_name = 'all_v3_0_Subjects'")
-strings = r[0]['option_value'].split('\\n')
-new_strings = []
-
-for string in strings:
- new_string = string.replace('\"', '')
- new_strings.append(new_string)
-print(new_strings)
-```
-
-Which will produce the following output:
-[GDC data version - v31.0, GDC extraction date - 03/17/2022, PDC data version - v2.7, PDC extraction date - 03/18/2022, IDC data version - v.4.0, IDC extraction date - 03/09/2022]
-
-## R3.0 ETL Achievements
-The achievements for R3.0 are outlined as follows:
-* Previous table format now called Subjects endpoint
- * Replaced all File entities with Files - a list of file ids associated with the entity that the list is located in. e.g
- * File -> Files
- * ResearchSubject.File -> ResearchSubject.Files
- * ResearchSubject.Specimen.File -> ResearchSubject.Specimen.Files
-* Files endpoint added:
- * Endpoint oriented around File information
- * Includes all information regarding the file's associated entities(Subject, ResearchSubject, and Specimen)
-* Updated PDC to data version 2.4 from data version 2.3
-
-## R3.0 ETL Process Overview
-
-Data from each DC (GDC, PDC, and IDC) is extracted and transformed independently. The case and file endpoints of GDC and PDC are queried via their publicly available API's to create case and file endpoint extracted data files(GDC case data, GDC file data, PDC case data, and PDC file data). Each extracted data file undergoes a transformation and aggregation step prior to being ready for merger with all of the other DCs transformed and aggregated data. At this point, all data from GDC and PDC are in the harmonized data schema, and representative of Subject and File endpoints in the CDA data schema. IDC Subjects and Files endpoints data files are created using a single BigQuery query from IDC's available table. These data files do not require additional aggregation prior to being ready for merger with GDC and PDC. All of the Subjects endpoint data files from each DC are then merged into a Subjects endpoint, and all of the Files endpoint files are merged into a Files endpoint. These two files are then uploaded to BigQuery as two separate tables. One for all Subjects and one for all Files. The CDA API can query from these two tables. An overview of the entire process can be seen in Figure 1 and will be described in more detail below.
-
-|  |
-|:---:|
-| **Figure 1** |
-
-### Current Flow of ETL
-
-The extraction and transformation process for GDC and PDC data are very similar. They can be broken into two sub-processes. The first includes extraction of the data from their cases and files endpoints, and transforming the data from the individual DC into the CCDH inspired data format. The second step merges the transformed data from both DCs into our Subjects and Files endpoints data formats.
-
-#### GDC/PDC Cases and Files Extraction and Transformation
-
-|  |
-|:---:|
-| **Figure 2** |
-
-The extraction process for each node and endpoint implements the publicly available APIs exposed by the nodes. All the information that is used within CDA Release 3 for GDC has been obtained from the _cases_ and _files_ endpoints. Information from PDC is pulled from _cases_, _files_, _program_, and _general_ endpoints. The majority of fields are coming from the _cases_ endpoint. The _files_ endpoint is used to get the files information and provide the link from files to associated specimens and cases. The resulting structure incorporates details about the case along with details about the files which are associated with the corresponding case, and specimens found within that case. In GDC, there are files that only link to cases, but any file that is linked to a specimen is also linked to the case that the specimen belongs to. The data files created by the extraction process are written with one case/**ResearchSubject** or file/**File** per line. These extracted data files are then submitted to the transformation code. The code reads the extracted data files line by line, and transforms each line into the data structure expected in our BigQuery tables.
-
-Since the extracted data file and the output of the transform code are written as one case/**ResearchSubject** per line, whereas our data structure is on a **Subject** by **Subject** basis, further aggregation of the data is needed for the Subjects endpoint. Aggregation of **Subject** entities in the **File** endpoint data file is also required. The aggregation code searches for any entries in the transformed data which have identical ids (**Subject** level id) and aggregates those entries together. Currently, for the **Subject** endpoint, the demographic information is coalesced between cases, whereas the **ResearchSubject** and **File** records from different cases are appended. For the **File** endpoint,the demographic information is coalesced between cases for the **Subject** entities. The error logs for the individual DC examine the demographic data of two or more correlating **Subject**/**ResearchSubject** records and logs any discrepancy.
-
-#### IDC Subjects and Files Extraction and Transformation
-
-|  |
-|:---:|
-| **Figure 3** |
-
-The extraction and transformation process of IDC data takes a more concise approach. One query for the Subjects endpoint is executed to extract all data from IDC and transform it into the CDA Subjects schema. A similar query is run for the creation of the Files endpoint data. Since IDC does not have demographic information, there is no need to do any logging of aggregation errors like that done in GDC and PDC.
-
-
-
-#### Merger of GDC, PDC, and IDC Data
-
-|  |
-|:---:|
-| **Figure 4** |
-
-The merging of data between GDC, PDC, and IDC is very similar to the aggregation step in the extraction and transformation sub-process for GDC and PDC. For the Subjects endpoint, the merge code searches the GDC, PDC, and IDC Subjects files for matching ids, coalesces the demographic information (GDC taking priority over PDC), and appends **ResearchSubject** and **File** records. An Inter-DC log consisting of discrepancies between GDC and PDC demographic information is created. For the Files endpoint, the merge code reads all of the **Subject** entity information created from the merged Subjects endpoint file just created, and replaces all **Subject** entities within the Files endpoint, with the information found in the merged Subjects endpoint file. The now merged Subjects and Files endpoint files are then uploaded to BigQuery as our Subjets and Files endpoint tables.
-
-## Appendix
-
-### GDC Extraction
-#### Cases Endpoint Data
-All the fields that are currently available through the CDA Subjects endpoint are pulled from the _cases_ endpoint. Since files information can be also obtained through the _cases_ endpoint (see files record under [GDC documentation for case fields](https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/#case-fields)), but only as a record that is linked to the case entity, GDC Extract step utilizes _files_ endpoint to enable linking files with specimens:
-
-
- Table 1. JSON on the left represents raw data that is pulled from the GDC API using _cases_ endpoint. On the right, we can see the final, processed JSON that includes files records under all specimen type entities. The complete list of fields that are used can be found here.
-
-GDC Extract w/out File/Specimen Link |
-GDC Extract with File/Specimen Link |
-
-
-
-
-{
- case_id: value,
- ...
- project: {...},
- demographic: {...},
- diagnoses: [...],
- samples: [
- {
- ...
- portions: [
- {
- ...
- slides: [...],
- analytes: [
- {
- ...
- aliquots: [...]
- }
- ]
- }
- ]
- }
- ],
- files: [...]
-}
-
- |
-
-
-{
- case_id: value,
- ...
- project: {...},
- demographic: {...},
- diagnoses: [...],
- samples: [
- {
- ...
- files: [...],
- portions: [
- {
- ...
- files: [...],
- slides: [
- {
- ...
- files: [...],
- analytes: [
- {
- ...
- files: [...],
- aliquots: [
- {
- ...
- files: [...]
- }
- ]
- }
- ]
- }
- ]
- }
- ],
- files: [...]
-}
-
-
- |
-
-
-
-
-To be able to associate files and specimen entities, the `file_id`, `cases.samples.sample_id`, `cases.samples.portions.portion_id`, `cases.samples.portions.slides.slide_id`, `cases.samples.portions.analytes.analyte_id`, and `cases.samples.portions.analytes.aliquots.aliquot_id` fields from the _files_ endpoint were used (see [GDC documentation for file fields](https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/#file-fields)).
-
-#### Files Endpoint Data
-All the fields that are currently available through the CDA Files endpoint are pulled from the [_files_ endpoint](https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/#file-fields). For extraction, no extra information or joining of data from the _cases_ endpoint is necessary.
-
-### PDC Extraction
-
-To get the PDC data, six graphQL queries were used:
-
-* filesMetadata
-* allPrograms
-* paginatedCaseDemographicsPerStudy
-* paginatedCaseDiagnosesPerStudy
-* paginatedCasesSamplesAliquots
-* biospecimenPerStudy
-
-First the _files_ endpoint is queried to get all file information, as well as linkages from files to aliquots, samples, and cases. This information is saved in a _files_ cache file. Next, all case and specimen information is extracted using the remaining queries. During the extraction of the case information, file_id information is joined to the associated cases and specimens. After all case information has been gathered and saved, the _files_ cache file info is read through, and information about relevant cases, aliquots, and samples are added.
-
-#### Creating _Files_ Cache File
-The only query used is filesMetadata. This query creates a the _files_ cache file that contains all of the _files_ endpoint information, indcluding all information about files, and the ids of associated cases, samples, and aliquots.
-
-#### Extract All _Cases_ Information
-
-The next query – allPrograms – is used to get all the available Programs and Studies. The extraction code loops over all Programs and Studies and performs several queries for each PDC study. Most queries are from the _cases_ endpoint and include paginatedCaseDemographicsPerStudy, paginatedCaseDiagnosesPerStudy, and paginatedCasesSamplesAliquots. They are used to gather the demographics, diagnoses, and specimen records for all cases within the study. biospecimenPerStudy is used solely to determine the taxon/species of the cases in the study. For each case and specimen in the study, file information is added from the _files_ cache file created by the filesMetadata query. Due to extracting case information by PDC study, some case information is duplicated in the extracted file since cases can be seen in more than one PDC study. The result looks as follows for a single PDC case record after adding file information:
-
-
-```
-{
- case_id: value,
- ...
- demographics: [...],
- diagnoses: [...],
- samples: [
- {
- ...
- files: [...],
- aliquots: [
- {
- ...
- files: [...]
- }
- ]
- }
- ],
- files: [...]
-}
-```
-The complete list of fields that are used can be found [here](./Schema.md).
-
-#### Add Case and Specimen Info to _Files_ Cache File
-After all _cases_ information has been extracted, and _file_ information added where necessary, case and specimen data are added to the _files_ cache file.
-
- Table 2. JSON on the left represents raw data that is pulled from the PDC API using the _files_ endpoint. On the right, we can see the final, processed JSON that includes cases, samples, and aliquots records under all file entities. The complete list of fields that are used can be found here.
-
-PDC _Files_ Extract w/out _Cases_ and _Specimen_ info |
-PDC _Files_ Extract with _Cases_ and _Specimen_ info |
-
-
-
-
-{
- file_id: value,
- ...
- aliquots: [
- {
- aliquot_id: value,
- sample_id: value,
- case_id: value
- },
- {
- aliquot_id: value,
- sample_id: value,
- case_id: value
- },
- ...
- ]
-}
-
- |
-
-
-{
- file_id: value,
- ...
- cases: [
- {
- case_id: value
- ...
- samples: [
- {
- sample_id: value,
- ...
- aliquots: [
- {
- aliquot_id: value,
- ...
- },
- ...
- ]
- },
- ...
- ]
- },
- ...
- ]
-}
-
- |
-
-
-
-### GDC and PDC Transformation
-
-Transformation in this section can for the most part be broken into two steps. The first transformation step has both structural and simple field name changes to the extracted data files. This first step implements mapping files for GDC _cases_/Subjects endpoint, GDC _files_/Files endpoint, PDC _cases_/Subjects endpoint, and PDC _files_/Files endpoint. The details of this process are slightly different for each DC and endpoint, however the end result is the same. Each entry in the resultant Subjects file still correlates to a case/**ResearchSubject**, but is in an equivalent structure to the final schema where each entry will correspond to a Subject. In this file, Subjects may correspond to multiple entries and must be aggregated together. Each entry in the resultant Files file correlates to a file/**File**, so no aggregation is required at the top level, but aggregation is needed in the **Subject** entities for the same reason aggregation is needed in the Subject file.
-
-The second step aggregates Subjects together from the same DC. In the Subjects file, for all entries that belong to the same **Subject**, the **ResearchSubject** records are appended underneath the same **Subject** entity. After this step, the data from each DC is in a common data format and ready for merging.
-
-For this section, the DC’s are similar enough that the differences can be shown with the aforementioned mapping from GDC/PDC fields to the common data format found [here](./Schema.md).
-
-
-##### step 1: Transformation
-
-For GDC and PDC, we iterate over every entry from the extracted data files and make specific changes to that entry. For the Subjects file, these include creating a top **Subject** level of data which correlates to the **Subject** entity as defined by the CCDH model. From there, the specific case information is recorded in a **ResearchSubject** entity, and transformations to the fields are changed to align best with the CCDH model. At the end of this transformation step, each entry is still representative of a case, now known as a **ResearchSubject**, but has **Subject** level information. A simplified example of an entry is given below in Table 3.
-
-
- Table 3
-
-Case Entry 1 |
-Transformed Case Entry 1 |
-
-
-
-
-{
- submitter_id: S1
- case_id: C3
- demographics:
- {days_to_birth: 45}
- primary_disease_site: Brain
- files:
- {file_id: file_1.doc}
- {file_id: file_2.txt}
- samples:
- {sample_id: samp_1
- files:
- {file_id: file_2.txt}
- }
-}
-
- |
-
-
-{
- id: S1
- days_to_birth: 45
- ResearchSubject:
- {id: C3
- primary_disease_site: Brain
- Files:[file_1.doc, file_2.txt]
- Specimen:
- {id: samp_1,
- Files:[file_2.txt]
- }
- }
- Files[file_1.doc, file_2.txt]
-}
-
- |
-
-
-
-Much like the Subjects file, the transformation of the Files file includes creating a top **File** level of data which correlates to the **File** entity as defined by the CCDH model. From there, the case information is used to create **Subject** entities as well as **ResearchSubject** entities, and transformations to the fields are changed to align best with the CCDH model. At the end of this transformation step, each entry within the **Subject** is still representative of a case, now known as a **ResearchSubject**, but has **Subject** level information (just like in the Subjects file transformation). A simplified example of an entry is given below in Table 4.
-
-
- Table 4
-
-File Entry 1 |
-Transformed File Entry 1 |
-
-
-
-
-{
- file_id: file_1.txt
- cases:
- [
- {case_id: case_1,
- submitter_id: S1,
- samples:[...]
- },
- {case_id: case_2,
- submitter_id: S1,
- samples:[...]
- }
- ]
-}
-
- |
-
-
-{
- id: file_1.txt
- Subject:
- [
- {id: S1},
- {id: S1}
- ]
- ResearchSubject:
- [
- {id: case_1},
- {id: case_2}
- ]
- Specimen:[...]
-}
-
- |
-
-
-
-##### step 2: Aggregation
-
-At this point for the transformed Subjects file, a list of Subjects and corresponding ResearchSubjects is made, and any **Subject** with multiple **ResearchSubject** records has the **ResearchSubject** and **File** records appended (with duplicates removed from **File** (and **ResearchSubject** in PDC)) under a single entry for the **Subject**. A simplified example of this aggregation can be seen in Table 5 below.
-
-
- Table 5
-
-Transformed Subject Entry 1 |
-Transformed Subject Entry 2 |
-Aggregated |
-
-
-
-
-{
- id: S1
- days_to_birth: 45
- ResearchSubject:
- {id: C3
- primary_disease_site: Brain
- Files: [file_1.doc, file_2.txt]
- Specimen:
- {id: samp_1,
- Files: [file_2.txt]
- }
- }
- Files: [file_1.doc, file_2.txt]
-}
-
- |
-
-
-{
- id: S1
- days_to_birth: 45
- ResearchSubject:
- {id: C7
- primary_disease_site: Brain
- Files: [file_1.doc, file_5.txt]
- Specimen:
- {id: samp_4,
- Files: [file_5.txt]
- }
- }
- Files: [file_1.doc, file_5.txt]
-}
-
- |
-
-
-{
- id: S1
- days_to_birth: 45
- ResearchSubject:
- {id: C3
- primary_disease_site: Brain
- Files: [file_1.doc, file_2.txt]
- Specimen:
- {id: samp_1
- Files: [file_2.txt]
- }
- },
- {id: C7
- primary_disease_site: Brain
- Files: [file_1.doc, file_5.txt]
- Specimen:
- {id: samp_4
- Files: [file_5.txt]
- }
- }
- Files: [file_1.doc, file2.txt, file_5.txt]
-}
-
- |
-
-
-
-Transformed Subject Entry 1 and 2 are aggregated in this example. Transformed Subject Entry 1 and 2 both correspond to the **Subject** with the id ‘S1’, but have different **ResearchSubject** records, and overlapping entries in their **Subject** level **File** records (file_1.doc is in both). The aggregated entry appended the **ResearchSubject** records together and appended the **File** records together while removing the duplicate entry.
-
-For the transformed Files file, a list of Subjects and ResearchSubjects is made, and any **Subject** or **ResearchSubject** with multiple records has them merged into a single entry for the **Subject** or **ResearchSubject**. A simplified example of this aggregation can be seen in Table 6 below.
-
-
- Table 6
-
-Transformed File Entry 1 |
-Aggregated |
-
-
-
-
-{
- id: File1
- Subject:
- [
- {id: S1},
- {id: S1},
- {id: S2}
- ],
- ResearchSubject:
- [
- {id: RS2},
- {id: RS2},
- {id: RS3}
- ],
- Specimen:
- [
- {id: samp_1},
- ]
-}
-
- |
-
-
-{
- id: File1
- Subject:
- [
- {id: S1},
- {id: S2}
- ],
- ResearchSubject:
- [
- {id: RS2},
- {id: RS3}
- ],
- Specimen:
- [
- {id: samp_1},
- ]
-}
-
- |
-
-
-
-### IDC Extraction and Transformation
-
-For Release 3, the IDC extraction and transformation process are executed using one query. This is possible due to IDC making their data available on BigQuery, as well as other features of BigQuery including temporary functions, array aggregation of structured data, and grouping data by particular fields (id, species, etc.). The queries currently used for the Subjects and Files endpoints are:
-```
-# Subjects Endpoint Query
-CREATE TEMP FUNCTION
- idc_species_mapping(x STRING)
- RETURNS STRING AS (CASE x
- WHEN 'Human' THEN 'Homo sapiens'
- WHEN 'Canine' THEN 'Canis familiaris'
- WHEN 'Mouse' THEN 'Mus musculus'
- ELSE
- ''
- END
- );
-CREATE TEMP FUNCTION
- idc_SUBSTR(x STRING)
- RETURNS STRING AS (SUBSTR(x, 15));
-CREATE TEMP FUNCTION
- idc_drs_uri(x STRING)
- RETURNS STRING AS (CONCAT("drs://dg.4DFC:", x));
-SELECT
- PatientID AS id,
- [STRUCT('IDC' AS system,
- PatientID AS value)] AS identifier,
- idc_species_mapping(tcia_species) AS species,
- STRING(NULL) AS sex,
- STRING(NULL) AS race,
- STRING(NULL) AS ethnicity,
- NULL AS days_to_birth,
- [collection_id] AS subject_associated_project,
- STRING(NULL) AS vital_status,
- NULL AS age_at_death,
- STRING(NULL) AS cause_of_death,
- ARRAY_AGG(crdc_instance_uuid) AS Files
-FROM
- `canceridc-data.idc_v4.dicom_pivot_v4`
-GROUP BY
- id,
- species,
- collection_id
-```
-```
-# Files Endpoint Query
-CREATE TEMP FUNCTION
- idc_species_mapping(x STRING)
- RETURNS STRING AS (CASE x
- WHEN 'Human' THEN 'Homo sapiens'
- WHEN 'Canine' THEN 'Canis familiaris'
- WHEN 'Mouse' THEN 'Mus musculus'
- ELSE
- ''
- END
- );
-CREATE TEMP FUNCTION
- idc_SUBSTR(x STRING)
- RETURNS STRING AS (SUBSTR(x, 15));
-CREATE TEMP FUNCTION
- idc_drs_uri(x STRING)
- RETURNS STRING AS (CONCAT("drs://dg.4DFC:", x));
-SELECT
- crdc_instance_uuid AS id,
- [STRUCT('IDC' AS system,
- crdc_instance_uuid AS value)] AS identifier,
- idc_SUBSTR(gcs_url) AS label,
- 'Imaging' AS data_category,
- STRING(NULL) AS data_type,
- 'DICOM' AS file_format,
- collection_id AS associated_project,
- idc_drs_uri(crdc_instance_uuid) AS drs_uri,
- NULL AS byte_size,
- STRING(NULL) AS checksum,
- 'Imaging' AS data_modality,
- Modality AS imaging_modality,
- STRING(NULL) AS dbgap_accession_number,
- [STRUCT(PatientID AS id,
- [STRUCT('IDC' AS system,
- PatientID AS value)] AS identifier,
- idc_species_mapping(tcia_species) AS species,
- STRING(NULL) AS sex,
- STRING(NULL) AS race,
- STRING(NULL) AS ethnicity,
- NULL AS days_to_birth,
- [collection_id] AS subject_associated_project,
- STRING(NULL) AS vital_status,
- NULL AS age_at_death,
- STRING(NULL) AS cause_of_death)] AS Subject
-FROM
- `canceridc-data.idc_v4.dicom_pivot_v4`
-GROUP BY
- id,
- gcs_url,
- Modality,
- collection_id,
- PatientID,
- tcia_species
-```
-
-#### Temp Functions
-
-The temporary functions created include `idc_species_mapping`, `idc_substr`, and `idc_drs_uri`. These functions are used to transform the IDC fields `tcia_species`, `gcs_url`, and `crdc_instance_uuid` to the CDA data schema fields `species`, `File.label`, and `File.drs_uri`. As more fields become available, and more transformations are necessary, more temporary functions will be added.
-
-#### SELECT ‘x’ AS ‘y’
-
-Due to the nature of the query, all fields desired in the CDA schema must be specified. The query is built using a mapping file similar to the GDC and PDC mapping files. If it is a direct mapping such as `PatientID` to id, then simply `PatientID` AS id works. For fields that require some type of transformation like `File.label` or `species`, the function is added to the query (eg `idc_species_mapping(tcia_species)` AS `species`). For integer type fields that have no mapping to IDC, `NULL` is mapped and for string type fields, `STRING(NULL)` is mapped. Any fields that have a string mapped to a field will be populated by the string listed (eg. 'DICOM' AS `file_format`). The IDC mapping file can be found [here](https://github.com/CancerDataAggregator/transform/blob/integration/IDC_mapping.yml).
-
-#### FROM and GROUP BY
-
-This statement selects which table from IDC to query from, as well as how to aggregate the data. It is grouped by `id` (**Subject** level `identifier`), then `species` and `collection_id` to keep with proper BigQuery formatting.
-
-### Merge
-#### Subjects Endpoint Merger
-After the data from GDC, PDC, and IDC have been transformed into a common data format, merging the data together can begin. For the Subjects endpoint, **Subject** level info is coalesced, and any data from records in **ResearchSubject** are simply appended underneath the same **Subject**, and the **Subject** Files lists are appended together. A simplified example of the merge between GDC, PDC, and IDC data can be seen in Table 7.
-
-
- Table 7. Simplified example of a merger between GDC, PDC, and IDC
-
-GDC |
-PDC |
-IDC |
-Merged |
-
-
-
-
-{
- id: A
- days_to_birth: 23
- race: None
- sex: M
- ResearchSubject:
- {id: A1
- ...
- }
- {id: A2
- ...
- }
- Files: [file_G1.doc]
-}
-
- |
-
-
-{
- id: A
- days_to_birth: None
- race: Caucasian
- sex: F
- ResearchSubject:
- {id: B4
- ...
- }
- {id: B5
- ...
- }
- Files: [file_P1.doc]
-}
-
- |
-
-
-{
- id: A
- days_to_birth:
- race:
- sex:
- Files: [file_I1.doc]
-}
-
- |
-
-
-{
- id: A
- days_to_birth: 23
- race: Caucasian
- sex: M
- ResearchSubject:
- {id: A1
- ...
- }
- {id: A2
- ...
- }
- {id: B4
- ...
- }
- {id: B5
- ...
- }
- Files: [file_G1.doc, file_P1.doc, file_I1.doc]
-}
-
- |
-
-
-
-##### Subject level merge
-
-The fields at the **Subject** level are merged based on coalescing data. The code looks for values in GDC, then PDC, and then IDC until a value is found. The first value found is used in the merged data. (if there is conflicting data, this is logged). In Figure 2, id under **Subject** must be the same to consider merging data. The example for days_to_birth shows that GDC has a value of 23, whereas PDC shows none. Since GDC has a populated value, the value from GDC is stored as the value for days_to_birth. Similarly, for race, GDC has no recorded value but PDC has a value of ‘Caucasian’. Since GDC is empty, and PDC has a value, the value from PDC is stored. The final example shows conflicting data between GDC and PDC. GDC records sex as ‘M’ whereas PDC records it as ‘F’. Due to conflicting information, this instance is recorded in a log, but the GDC value of ‘M’ is used in the merged data.
-
-##### ResearchSubject level append
-
-Looking at Table 7, in the merged data, all of the records from **ResearchSubject** from GDC and PDC are appended. At this time there is no equivalent **ResearchSubject** entity available for IDC data. Currently no merging happens at this level.
-
-##### File level append
-
-Looking at Table 7, in the merged data, all of the records from **File** from GDC, PDC, and IDC are appended. Currently no merging happens at this level.
-
-For the Subjects endpoint, **Subject** level info is coalesced, and any data from records in **ResearchSubject** are simply appended underneath the same **Subject**, and the **Subject** Files lists are appended together. A simplified example of the merge between GDC, PDC, and IDC data can be seen in Table 7.
-
-
- Table 7. Simplified example of a merger between GDC, PDC, and IDC
-
-GDC |
-PDC |
-IDC |
-Merged |
-
-
-
-
-{
- id: A
- days_to_birth: 23
- race: None
- sex: M
- ResearchSubject:
- {id: A1
- ...
- }
- {id: A2
- ...
- }
- Files: [file_G1.doc]
-}
-
- |
-
-
-{
- id: A
- days_to_birth: None
- race: Caucasian
- sex: F
- ResearchSubject:
- {id: B4
- ...
- }
- {id: B5
- ...
- }
- Files: [file_P1.doc]
-}
-
- |
-
-
-{
- id: A
- days_to_birth:
- race:
- sex:
- Files: [file_I1.doc]
-}
-
- |
-
-
-{
- id: A
- days_to_birth: 23
- race: Caucasian
- sex: M
- ResearchSubject:
- {id: A1
- ...
- }
- {id: A2
- ...
- }
- {id: B4
- ...
- }
- {id: B5
- ...
- }
- Files: [file_G1.doc, file_P1.doc, file_I1.doc]
-}
-
- |
-
-
-
-#### Files Endpoint Merger ####
-
-Since there are no shared files between any of the DC's, no top level **File** information can be merged. The problem is that there is no easy way to correct the **Subject** entity information strictly from the Files endpoint file. For this reason, CDA merges the Subjects endpoint information first, and uses the records from the fully merged Subjects endpoint to overwrite the corresponding **Subject** entities found in the Files endpoint. A simplified example of the this is shown in Table 8.
-
-
- Table 8. Simplified example of a merger between GDC, PDC, and IDC
-
-Relevant Merged Subject |
-Unmerged File Record |
-File w/ Merged Subjects |
-
-
-
-
-{
- id: A
- days_to_birth: 23
- race: Caucasian
- sex: M
- ResearchSubject:
- {id: A1
- ...
- }
- {id: A2
- ...
- }
- Files: [file_G1.doc]
-}
-
- |
-
-
-{
- id: F1
- Subject:
- [
- {id: A
- days_to_birth: None
- race: None
- sex: M
- }
- ]
- ResearchSubject:
- {id: B4
- ...
- }
- {id: B5
- ...
- }
-}
-
- |
-
-
-{
- id: F1
- Subject:
- [
- {id: A
- days_to_birth: 23
- race: Caucasian
- sex: M
- }
- ]
- ResearchSubject:
- [
- {id: B4
- ...
- },
- {id: B5
- ...
- },
- ]
-}
-
- |
-
-
-
-##### Subject level merge
-
-The fields at the **Subject** level are overwritten by the **Subject** information found in the fully merged Subjects endpoint. The record for the **Subject** entity with id = A is missing demographic information in the Files endpoint data. This can happen when another DC contains this missing information. As such, the **Subject** entity information is overwritten by that found in the fully merged Subjects endpoint.
diff --git a/docs/source/ETL_Figures/ETL_Fig1.png b/docs/source/ETL_Figures/ETL_Fig1.png
deleted file mode 100644
index 78e7a72..0000000
Binary files a/docs/source/ETL_Figures/ETL_Fig1.png and /dev/null differ
diff --git a/docs/source/ETL_Figures/ETL_Fig2.png b/docs/source/ETL_Figures/ETL_Fig2.png
deleted file mode 100644
index 25bc620..0000000
Binary files a/docs/source/ETL_Figures/ETL_Fig2.png and /dev/null differ
diff --git a/docs/source/ETL_Figures/ETL_Fig3.png b/docs/source/ETL_Figures/ETL_Fig3.png
deleted file mode 100644
index ec84de4..0000000
Binary files a/docs/source/ETL_Figures/ETL_Fig3.png and /dev/null differ
diff --git a/docs/source/ETL_Figures/ETL_Fig4.png b/docs/source/ETL_Figures/ETL_Fig4.png
deleted file mode 100644
index 365e4b2..0000000
Binary files a/docs/source/ETL_Figures/ETL_Fig4.png and /dev/null differ
diff --git a/docs/source/Installation.rst b/docs/source/Installation.rst
deleted file mode 100644
index 4bdee74..0000000
--- a/docs/source/Installation.rst
+++ /dev/null
@@ -1,53 +0,0 @@
-Installation
-=====
-
-.. _installation:
-
-Currently there are two methods for installation (Docker & pip), and an additional method that requires no installation (binder). The instructions for all of these methods can be found below. If you want to just jump in and start using CDA-python, we suggest that you try the binder link. This link will take you to an example Jupyter notebook with some basic commands and example queries.
-
-
-Install the CDA Python library locally
---------------------------------------
-
-1. Download and install docker. Click this `link `_ or copy https://www.docker.com/products/docker-desktop to your browser.
-2. Open Terminal or PowerShell and navigate to the cda-python folder. Then run the docker command:
-
- - ``docker-compose up --build``
-3. Open a Browser to http://localhost:8888 and you are up and running.
-
-4. To stop the container from running, return to the terminal window (from step 2), and type **Control C to stop** the container .
-
-To delete the container from your machine, use this command in the cdapython project directory.
-
-- ``docker compose down``
-
-Pip install
-------------
-Alternatively, CDA Python can be installed using ``pip``. However, this requires python version >= 3.7 on your system. To check your version at the command-line, run ``python -V``. To update your version you can download from https://www.python.org/downloads/. Additional python installation help can be found `here `_ . Once you have the proper python version, you can run CDA using:
-
-.. code-block:: console
-
- pip install git+https://github.com/CancerDataAggregator/cda-python.git
-
-To install a specific version you can use this command:
-
-.. code-block:: console
-
- pip install git+https://github.com/CancerDataAggregator/cda-python.git@
-
-Currently we have version:
- - 2.0.0
- - 2.1.0
-
-.. note::
-
- We recommend the docker method because pip installation can be a bit more cumbersome, and will not be as closely monitored as the docker installation.
-
-
-Launch in MyBinder:
---------
-
-To try out the example notebook in `MyBinder.org `_ without having to install anything, just click on the logo below. This will launch a Jupyter Notebook instance with our example notebook ready to run.
-
-.. image:: https://img.shields.io/badge/launch-binder-579aca.svg?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFkAAABZCAMAAABi1XidAAAB8lBMVEX///9XmsrmZYH1olJXmsr1olJXmsrmZYH1olJXmsr1olJXmsrmZYH1olL1olJXmsr1olJXmsrmZYH1olL1olJXmsrmZYH1olJXmsr1olL1olJXmsrmZYH1olL1olJXmsrmZYH1olL1olL0nFf1olJXmsrmZYH1olJXmsq8dZb1olJXmsrmZYH1olJXmspXmspXmsr1olL1olJXmsrmZYH1olJXmsr1olL1olJXmsrmZYH1olL1olLeaIVXmsrmZYH1olL1olL1olJXmsrmZYH1olLna31Xmsr1olJXmsr1olJXmsrmZYH1olLqoVr1olJXmsr1olJXmsrmZYH1olL1olKkfaPobXvviGabgadXmsqThKuofKHmZ4Dobnr1olJXmsr1olJXmspXmsr1olJXmsrfZ4TuhWn1olL1olJXmsqBi7X1olJXmspZmslbmMhbmsdemsVfl8ZgmsNim8Jpk8F0m7R4m7F5nLB6jbh7jbiDirOEibOGnKaMhq+PnaCVg6qWg6qegKaff6WhnpKofKGtnomxeZy3noG6dZi+n3vCcpPDcpPGn3bLb4/Mb47UbIrVa4rYoGjdaIbeaIXhoWHmZYHobXvpcHjqdHXreHLroVrsfG/uhGnuh2bwj2Hxk17yl1vzmljzm1j0nlX1olL3AJXWAAAAbXRSTlMAEBAQHx8gICAuLjAwMDw9PUBAQEpQUFBXV1hgYGBkcHBwcXl8gICAgoiIkJCQlJicnJ2goKCmqK+wsLC4usDAwMjP0NDQ1NbW3Nzg4ODi5+3v8PDw8/T09PX29vb39/f5+fr7+/z8/Pz9/v7+zczCxgAABC5JREFUeAHN1ul3k0UUBvCb1CTVpmpaitAGSLSpSuKCLWpbTKNJFGlcSMAFF63iUmRccNG6gLbuxkXU66JAUef/9LSpmXnyLr3T5AO/rzl5zj137p136BISy44fKJXuGN/d19PUfYeO67Znqtf2KH33Id1psXoFdW30sPZ1sMvs2D060AHqws4FHeJojLZqnw53cmfvg+XR8mC0OEjuxrXEkX5ydeVJLVIlV0e10PXk5k7dYeHu7Cj1j+49uKg7uLU61tGLw1lq27ugQYlclHC4bgv7VQ+TAyj5Zc/UjsPvs1sd5cWryWObtvWT2EPa4rtnWW3JkpjggEpbOsPr7F7EyNewtpBIslA7p43HCsnwooXTEc3UmPmCNn5lrqTJxy6nRmcavGZVt/3Da2pD5NHvsOHJCrdc1G2r3DITpU7yic7w/7Rxnjc0kt5GC4djiv2Sz3Fb2iEZg41/ddsFDoyuYrIkmFehz0HR2thPgQqMyQYb2OtB0WxsZ3BeG3+wpRb1vzl2UYBog8FfGhttFKjtAclnZYrRo9ryG9uG/FZQU4AEg8ZE9LjGMzTmqKXPLnlWVnIlQQTvxJf8ip7VgjZjyVPrjw1te5otM7RmP7xm+sK2Gv9I8Gi++BRbEkR9EBw8zRUcKxwp73xkaLiqQb+kGduJTNHG72zcW9LoJgqQxpP3/Tj//c3yB0tqzaml05/+orHLksVO+95kX7/7qgJvnjlrfr2Ggsyx0eoy9uPzN5SPd86aXggOsEKW2Prz7du3VID3/tzs/sSRs2w7ovVHKtjrX2pd7ZMlTxAYfBAL9jiDwfLkq55Tm7ifhMlTGPyCAs7RFRhn47JnlcB9RM5T97ASuZXIcVNuUDIndpDbdsfrqsOppeXl5Y+XVKdjFCTh+zGaVuj0d9zy05PPK3QzBamxdwtTCrzyg/2Rvf2EstUjordGwa/kx9mSJLr8mLLtCW8HHGJc2R5hS219IiF6PnTusOqcMl57gm0Z8kanKMAQg0qSyuZfn7zItsbGyO9QlnxY0eCuD1XL2ys/MsrQhltE7Ug0uFOzufJFE2PxBo/YAx8XPPdDwWN0MrDRYIZF0mSMKCNHgaIVFoBbNoLJ7tEQDKxGF0kcLQimojCZopv0OkNOyWCCg9XMVAi7ARJzQdM2QUh0gmBozjc3Skg6dSBRqDGYSUOu66Zg+I2fNZs/M3/f/Grl/XnyF1Gw3VKCez0PN5IUfFLqvgUN4C0qNqYs5YhPL+aVZYDE4IpUk57oSFnJm4FyCqqOE0jhY2SMyLFoo56zyo6becOS5UVDdj7Vih0zp+tcMhwRpBeLyqtIjlJKAIZSbI8SGSF3k0pA3mR5tHuwPFoa7N7reoq2bqCsAk1HqCu5uvI1n6JuRXI+S1Mco54YmYTwcn6Aeic+kssXi8XpXC4V3t7/ADuTNKaQJdScAAAAAElFTkSuQmCC
- :target: https://mybinder.org/v2/gh/CancerDataAggregator/cda-python/HEAD?filepath=/notebooks/example.ipynb
diff --git a/docs/source/ReleaseNotes.md b/docs/source/ReleaseNotes.md
deleted file mode 100644
index ce3cfb5..0000000
--- a/docs/source/ReleaseNotes.md
+++ /dev/null
@@ -1,98 +0,0 @@
-# CDA Release 3.0
-
-## Release Notes
-
-Last updated: March 22nd, 2022
-
-### Introduction to CDA
-
-
-
-The Release 3.0 of CDA searches across data from the Genomics Data Commons (GDC), the Proteomics Data Commons (PDC), and the Imaging Data Commons (IDC) to aggregate and return data to users via a single application programming interface (API). CDA leverages the work and data model that is concurrently being developed by the [Center for Cancer Data Harmonization](https://datascience.cancer.gov/data-commons/center-cancer-data-harmonization-ccdh) (CCDH). CCDH will provide a single data model that harmonizes syntax and semantics across the CRDC systems and services.
-
-The CCDH data model promises to be a specimen-centric model whereas current CRDC nodes tend to use a case-centric approach. The diagrams below depict the shift from the respective GDC and PDC entity models (provided by CCDH - Figure 1) towards a specimen-centric model (Figure 2).
-
-|  |
-|:---:|
-| **Figure 1**: The PDC and GDC data models are case centric. |
-
-
-
-|  |
-|:---:|
-| **Figure 2**: CCDH is moving towards a specimen centric model. |
-
-As the CCDH model develops, CDA leverages the harmonization work of the [CCDH model](https://cancerdhc.github.io/ccdhmodel/entities/) by extending the model only where necessary, such as adding key search fields, to support CDA functionality. CDA periodically synchronizes with CCDH to maintain consistency between the [data model](https://github.com/CancerDataAggregator/cda-data-model) implemented in CDA and the developing CCDH model. This data model is expressed as JSON Schema.
-
-
-|  |
-|:---:|
-| **Figure 3**: The data model is expressed as JSON schema. |
-
-In Figure 3, the entities rimmed in blue are not yet part of the CCDH model but are extensions to allow CDA to aggregate and deliver data as the CCDH model evolves. It may be helpful to think about your queries in terms of these entities (e.g. Specimen, Subject, Research Subject, Project, Diagnosis) and their attributes (e.g. derived_from_subject, ethnicity, reference_assembly).
-
-When the data is fully harmonized, finding data for a specific subject or a specific specimen across all nodes will be simplified. For Release 1, CDA was able to identify common subjects across GDC and PDC in cases where the PDC “case_submitter_id” is equal to the GDC “submitter_id”, essentially harmonizing this field. Further alignment was achieved by using lower case characters consistently. For Release 2, CDA similarly merged IDC with GDC and PDC data by matching IDC “PatientID” with GDC “submitter_id” and PDC “case_submitter_id”.
-
-To assist you with the transition to CCDH data model terminology, we have provided a field by field mapping of terms from the GDC, PDC[^1], and IDC data dictionaries to the implemented JSON schema. This information can be found in [CDA Schema Field Mapping](./Schema.md).
-
-For details on the extraction, transformation, load (ETL) process, please see [CDA ETL Process](./ETL.md).
-
-
-## Datasets & Fields
-
-* All datasets updated as follows
- * GDC: v31.0, 03/17/2022
- * PDC: v2.7, 03/18/2022
- * IDC: v.4.0, 03/09/2022
-
-## Enhanced query functionality
-
-* Added Docker for enabling quickstart
-* Enhanced Q functionality to more mimic natural language
-* Support for long queries
-* Added asynchronous hooks, ability to do searches in parallel
-* Can write raw SQL queries
-* Can check status of BigQuery to see if the source data table is up in Q
-* Unique values and columns return faster than in Release 1
-
-
-## Metadata Changes
-
-* See [CDA Schema Field Mapping](./Schema.md)
-
-* Summary
- * Previous table format now called Subjects endpoint
- * Replaced all File entities with Files - a list of file ids associated with the entity that the list is located in. e.g
- * File -> Files
- * ResearchSubject.File -> ResearchSubject.Files
- * ResearchSubject.Specimen.File -> ResearchSubject.Specimen.Files
- * Files endpoint added:
- * Endpoint oriented around File information
- * Includes all information regarding the file's associated entities(Subject, ResearchSubject, and Specimen)
- * Newly available fields:
- * None
- * Renamed fields (old -> new):
- * None
-
-
-## Bug fixes
-
-* Fixed problem of unnested items would appear at the top level in the JSON response, resulting in duplication of elements
-* Fixed cda-service overwrites query columns with same name
-* Added support for the API queries to asynchronous calls.
-* Added a unique-values API endpoint for returning distinct values in the cda table
-* There is a new status endpoint which will verify that the cda_mvp tables are available. Please let the dev team know if there is additional info that would be useful from this endpoint in the future. "total table rows, size of schemas etc"
-* Support the ability to query development tables being used for integrating the IDC data with the existing PDC and GDC data. If your jupyter notebooks fail to execute, you will likely have to reload the cda-python into your python virtual environment.
-
-
-## Known bugs and issues
-
-* `unique_terms` are not sorted when they return
-* tumor stages are not harmonized, there are redundant terms (complicates query)
-* Days_to_birth should be reformatted (currently negative) or have an example query
-* Docker jupyter notebook does not work if a notebook is already open in port 8888
-
-
-
-[^1]:
- Information pulled from the PDC API may contain embargoed data.
diff --git a/docs/source/ReleaseNotesFigs/CCDH_Specimen-centric_Jun2020.png b/docs/source/ReleaseNotesFigs/CCDH_Specimen-centric_Jun2020.png
deleted file mode 100644
index 18c17d5..0000000
Binary files a/docs/source/ReleaseNotesFigs/CCDH_Specimen-centric_Jun2020.png and /dev/null differ
diff --git a/docs/source/ReleaseNotesFigs/CDA_MVP_Release_1.png b/docs/source/ReleaseNotesFigs/CDA_MVP_Release_1.png
deleted file mode 100644
index 8bc763a..0000000
Binary files a/docs/source/ReleaseNotesFigs/CDA_MVP_Release_1.png and /dev/null differ
diff --git a/docs/source/ReleaseNotesFigs/CancerDataAggregator_PMD_0.png b/docs/source/ReleaseNotesFigs/CancerDataAggregator_PMD_0.png
deleted file mode 100644
index 11c5d75..0000000
Binary files a/docs/source/ReleaseNotesFigs/CancerDataAggregator_PMD_0.png and /dev/null differ
diff --git a/docs/source/ReleaseNotesFigs/GDCPDCModels.png b/docs/source/ReleaseNotesFigs/GDCPDCModels.png
deleted file mode 100644
index da0c445..0000000
Binary files a/docs/source/ReleaseNotesFigs/GDCPDCModels.png and /dev/null differ
diff --git a/docs/source/Schema.md b/docs/source/Schema.md
deleted file mode 100644
index c337de9..0000000
--- a/docs/source/Schema.md
+++ /dev/null
@@ -1,361 +0,0 @@
-# Schema and Mapping
-## Subjects Endpoint
-### Full Schema
-
-The **Subject** entity is the outer most record in the Subjects endpoint. Within the **Subject** record are the fields for the **Subject** (demographic and other subject-specific information), a Files field which lists the id of all files associated with the **Subject**, as well as the record of all **ResearchSubject** records associated with that **Subject**. Each **ResearchSubject** record has fields associated with the **ResearchSubject**, as well as records for the **Diagnosis** and **Specimen** entities associated with that **ResearchSubject**, and so on.
-
-
-| [Subject](subject_S) | | | | |
-| ------- | ------------------------------------------- | ---------------------------- | ----------------------------- | -------------------------- |
-| | id | | | |
-| | identifier.system | | | |
-| | identifier.value | | | |
-| | species | | | |
-| | sex | | | |
-| | race | | | |
-| | ethnicity | | | |
-| | days\_to\_birth | | | |
-| | subject\_associated\_project | | | |
-| | vital\_status | | | |
-| | age\_at\_death | | | |
-| | cause\_of\_death | | | |
-| | Files | | | |
-| | **[ResearchSubject](researchsubject_S):** | | | |
-| | | id | | |
-| | | identifier.system | | |
-| | | identifier.value | | |
-| | | primary\_diagnosis\_condition| | |
-| | | primary\_diagnosis\_site | | |
-| | | member\_of\_research\_project| | |
-| | | **[Diagnosis](diagnosis_S):** | | |
-| | | | id | |
-| | | | identifier.system | |
-| | | | identifier.value | |
-| | | | primary\_diagnosis | |
-| | | | age\_at\_diagnosis | |
-| | | | morphology | |
-| | | | stage | |
-| | | | grade | |
-| | | | method\_of\_diagnosis | |
-| | | | **[Treatment](treatment_S):** | |
-| | | | | id |
-| | | | | identifier.system |
-| | | | | identifier.value |
-| | | | | treatment\_type |
-| | | | | treatment\_outcome |
-| | | | | days\_to\_treatment\_start |
-| | | | | days\_to\_treatment\_end |
-| | | | | therapeutic\_agent |
-| | | | | treatment\_anatomic\_site |
-| | | | | treatment\_effect |
-| | | | | treatment\_end\_reason |
-| | | | | number\_of\_cycles |
-| | | Files | | |
-| | | **[Specimen](specimen_S):** | | |
-| | | | id | |
-| | | | identifier.system | |
-| | | | identifier.value | |
-| | | | associated\_project | |
-| | | | age\_at\_collection | |
-| | | | primary\_disease\_type | |
-| | | | anatomical\_site | |
-| | | | source\_material\_type | |
-| | | | specimen\_type | |
-| | | | derived\_from\_specimen | |
-| | | | derived\_from\_subject | |
-| | | | Files | |
-
-### Mapping
-(subject_S)=
-#### Subject
-
-All GDC, PDC, and IDC field names use a 'dot' notation to specify the paths. The first word denotes the endpoints that the field is taken from, such as "file(s)" or "case(s)". The rest of the field name is as seen in GDC and PDC documentation for those endpoints. Since we extract all IDC data from one pivot table which is written oriented around files, we use "files" as the endpoint for all fields from IDC.
-
-| Common Data Format (present in CDA) | GDC field name | PDC field name | IDC field name |
-| ----------------------------------- | ---------------------------------------------------------------------- | ---------------------------------- | -------------------------------- |
-| id | case.submitter\_id | cases.case\_submitter\_id | files.PatientID |
-| identifier.system | GDC | PDC | IDC |
-| identifier.value | case.submitter\_id | cases.case\_submitter\_id | files.PatientID |
-| species | Homo sapiens | cases.taxon | files.tcia\_species |
-| sex | case.demographic.gender | cases.demographics.gender | No available mapping |
-| race | case.demographic.race | cases.demographics.race | No available mapping |
-| ethnicity | case.demographic.ethnicity | cases.demographics.ethnicity | No available mapping |
-| days\_to\_birth | case.demographic.days\_to\_birth | cases.demographics.days\_to\_birth | No available mapping |
-| subject\_associated\_project | Create array of all projects for all ResearchSubjects for this Subject | | files.collection\_id |
-| vital\_status | case.demographic.vital\_status | cases.demographic.vital\_status | No available mapping |
-| age\_at\_death | case.demographic.days\_to\_death | cases.demographic.days\_to\_death | No available mapping |
-| cause\_of\_death | case.demographic.cause\_of\_death | cases.demographic.cause\_of\_death | No available mapping |
-| Files | Create array of all associated File ids for this Subject | | |
-| |
-| [ResearchSubject](researchsubject_S) | | | No ResearchSubject entity in IDC |
-
-(researchsubject_S)=
-#### ResearchSubject
-| Common Data Format (present in CDA) | GDC field name | PDC field name | IDC field name |
-| ----------------------------------- | -------------------------- | ---------------------------- | -------------- |
-| id | cases.case\_id | cases.case\_id | Not Applicable |
-| identifier.system | GDC | PDC | |
-| identifier.value | cases.case\_id | cases.case\_id | |
-| primary\_diagnosis\_condition | cases.disease\_type | cases.disease\_type | |
-| primary\_diagnosis\_site | cases.primary\_site | cases.primary\_site | |
-| member\_of\_research\_project | cases.project.project\_id | cases.project\_submitter\_id | |
-| Files | Create array of all associated File ids for this ResearchSubject |
-| |
-| [Diagnosis](diagnosis_S) | | | |
-| [Specimen](specimen_S) | | | |
-
-(diagnosis_S)=
-#### Diagnosis
-| Common Data Format (present in CDA) | GDC field name | PDC field name | IDC field name |
-| ----------------------------------- | ------------------------------------ | ------------------------------------ | -------------- |
-| id | cases.diagnoses.diagnosis\_id | cases.diagnoses.diagnosis\_id | Not applicable |
-| identifier.system | GDC | PDC | |
-| identifier.value | cases.diagnoses.diagnosis\_id | cases.diagnoses.diagnosis\_id | |
-| primary\_diagnosis | cases.diagnoses.primary\_diagnosis | cases.diagnoses.primary\_diagnosis | |
-| age\_at\_diagnosis | cases.diagnoses.age\_at\_diagnosis | cases.diagnoses.age\_at\_diagnosis | |
-| morphology | cases.diagnoses.morphology | cases.diagnoses.morphology | |
-| grade | cases.diagnoses.tumor\_grade | cases.diagnoses.tumor\_grade | |
-| stage | cases.diagnoses.tumor\_stage | cases.diagnoses.tumor\_stage | |
-| method\_of\_diagnosis | cases.diagnosis.method\_of\_diagnosis| cases.diagnosis.method\_of\_diagnosis| |
-| | | | |
-| [Treatment](treatment_S) | | | |
-
-(treatment_S)=
-#### Treatment
-| Common Data Format (present in CDA) | GDC field name | PDC field name | IDC field name |
-| ----------------------------------- | --------------------------------------------------- | ----------------------------------------------------- | -------------- |
-| id | cases.diagnoses.treatments.treatment\_id | cases.diagnoses.treatments.treatment\_id | Not applicable |
-| identifier.system | GDC | PDC | |
-| identifier.value | cases.diagnoses.treatments.treatment\_id | cases.diagnoses.treatments.treatment\_id | |
-| treatment\_type | cases.diagnoses.treatments.treatment\_type | cases.diagnoses.treatments.treatment\_type | |
-| treatment\_outcome | cases.diagnoses.treatments.treatment\_outcome | cases.diagnoses.treatments. treatment\_outcome | |
-| days\_to\_treatment\_start | cases.diagnoses.treatments.days\_to\_treatment\_start| cases.diagnoses.treatments.days\_to\_treatment\_start | |
-| days\_to\_treatment\_end | cases.diagnoses.treatments.days\_to\_treatment_end | cases.diagnoses.treatments.days\_to\_treatment\_end | |
-| therapeutic\_agent | cases.diagnoses.treatments.therapeutic\_agents | cases.diagnoses.treatments.therapeutic\_agents | |
-| treatment\_anatomic\_site | cases.diagnoses.treatments.treatment\_anatomic\_site| cases.diagnoses.treatments.treatment\_anatomic\_site | |
-| treatment\_effect | cases.diagnoses.treatments.treatment\_effect | cases.diagnoses.treatments.treatment\_effect | |
-| treatment\_end\_reason | cases.diagnoses.treatments.reason\_treatment\_ended | cases.diagnoses.treatments.reason\_treatment\_ended | |
-| number\_of\_cycles | cases.diagnoses.treatments.number\_of\_cycles | cases.diagnoses.treatments.number\_of\_cycles | |
-
-(specimen_S)=
-#### Specimen
-| Common Data Format (present in CDA) | GDC field name | PDC field name | IDC field name |
-| ----------------------------------- | ---------------------------------------------------- | ----------------------------------------- | -------------- |
-| id | cases.samples.sample\_id | cases.samples.sample\_id | Not applicable |
-| | cases.samples.portions.portion\_id | cases.samples.aliquots.aliquot\_id | |
-| | cases.samples.portions.slide\_id | | |
-| | cases.samples.portions.analytes.analyte\_id | | |
-| | cases.samples.portions.analytes.aliquots.aliquot\_id | | |
-| identifier.system | GDC | PDC | |
-| identifier.value | cases.samples.sample\_id | cases.samples.sample\_id | |
-| | cases.samples.portions.portion\_id | cases.samples.aliquots.aliquot\_id | |
-| | cases.samples.portions.slide\_id | | |
-| | cases.samples.portions.analytes.analyte\_id | | |
-| | cases.samples.portions.analytes.aliquots.aliquot\_id | | |
-| associated\_project | cases.project.project\_id | cases.project\_submitter\_id | |
-| age\_at\_collection | cases.demographic.days\_to\_birth | cases.demographics.days\_to\_birth | |
-| derived\_from\_subject | cases.submitter\_id | cases.case\_submitter\_id | |
-| primary\_disease\_type | cases.disease\_type | cases.disease\_type | |
-| anatomical\_site | cases.samples.biospecimen\_anatomic\_site | cases.samples.biospecimen\_anatomic\_site | |
-| source\_material\_type | cases.samples.sample\_type | cases.samples.sample\_type | |
-| specimen\_type | sample, portion, slide, analyte, aliquot | sample, aliquot | |
-| derived\_from\_specimen | cases.samples.sample\_id | cases.samples.sample\_id | |
-| | cases.samples.portions.portion\_id | | |
-| | cases.samples.portions.analytes.analyte\_id | | |
-| derived\_from\_subject | cases.submitter\_id | cases.case\_submitter\_id | |
-| Files | Create array of all associated File ids for this Specimen | | |
-
-## Files Endpoint
-### Full Schema
-
-The **File** entity is the outer most record in the Files endpoint. Within the **File** record are the fields for the **File** (file metadata), as well as the record of all **Subject**, **ResearchSubject**, and **Specimen** records associated with that **File**. Each of the entities previously mentioned has fields associated with those entities.
-
-| [File](File) | | | | |
-| ------- | ------------------------------------------- | ---------------------------- | ----------------------------- | -------------------------- |
-| | id | | |
-| | identifier.system | | |
-| | identifier.value | | |
-| | label | | |
-| | data\_category | | |
-| | data\_type | | |
-| | file\_format | | |
-| | associated\_project | | |
-| | drs\_uri | | |
-| | byte\_size | | |
-| | checksum | | |
-| | data\_modality | | |
-| | imaging\_modality | | |
-| | dbgap\_accession\_number | |
-| | **[Subject](subject_F):** | |
-| | | id | | |
-| | | identifier.system | | |
-| | | identifier.value | | |
-| | | species | | |
-| | | sex | | |
-| | | race | | |
-| | | ethnicity | | |
-| | | days\_to\_birth | | |
-| | | subject\_associated\_project | | |
-| | | vital\_status | | |
-| | | age\_at\_death | | |
-| | | cause\_of\_death | | |
-| | **[ResearchSubject](researchsubject_F):** | | | |
-| | | id | | |
-| | | identifier.system | | |
-| | | identifier.value | | |
-| | | primary\_diagnosis\_condition| | |
-| | | primary\_diagnosis\_site | | |
-| | | member\_of\_research\_project| | |
-| | | **[Diagnosis](diagnosis_F):** | | |
-| | | | id | |
-| | | | identifier.system | |
-| | | | identifier.value | |
-| | | | primary\_diagnosis | |
-| | | | age\_at\_diagnosis | |
-| | | | morphology | |
-| | | | stage | |
-| | | | grade | |
-| | | | method\_of\_diagnosis | |
-| | | | **[Treatment](treatment_F):** | |
-| | | | | id |
-| | | | | identifier.system |
-| | | | | identifier.value |
-| | | | | treatment\_type |
-| | | | | treatment\_outcome |
-| | | | | days\_to\_treatment\_start |
-| | | | | days\_to\_treatment\_end |
-| | | | | therapeutic\_agent |
-| | | | | treatment\_anatomic\_site |
-| | | | | treatment\_effect |
-| | | | | treatment\_end\_reason |
-| | | | | number\_of\_cycles |
-| |**[Specimen](specimen_F):** | | | |
-| | | id | |
-| | | dentifier.system | |
-| | | identifier.value | |
-| | | associated\_project | |
-| | | age\_at\_collection | |
-| | | primary\_disease\_type | |
-| | | anatomical\_site | |
-| | | source\_material\_type | |
-| | | specimen\_type | |
-| | | derived\_from\_specimen | |
-| | | derived\_from\_subject | |
-
-### Mapping
-(File)=
-#### File
-| Common Data Format (present in CDA) | GDC field name | PDC field name | IDC field name |
-| ----------------------------------- | ------------------------------------- | ------------------------------ | ------------------------------------- |
-| id | files.file\_id | files.file\_id | files.crdc\_instance\_uuid' |
-| identifier.system | GDC | PDC | IDC |
-| identifier.value | files.file\_id | files.file\_id | files.crdc\_instance\_uuid' |
-| label | files.file\_name | files.file\_name | files.gcs\_url |
-| data\_category | files.data\_category | files.data\_category | Imaging |
-| data\_type | files.data\_type | files.file\_type | No mapping available |
-| file\_format | files.data\_format | files.file\_format | DICOM |
-| associated\_project | cases.project.project\_id | cases.project\_submitter\_id | files.collection\_id |
-| drs\_uri | drs://dg.4DFC:{files.file\_id} | drs://dg.4DFC:{files.file\_id} | No DCF formatting currently available |
-| byte\_size | files.file\_size | files.file\_size | No mapping available |
-| checksum | files.md5sum | files.md5sum | No mapping available |
-| data\_modality | Genomic | Proteomic | Imaging |
-| imaging\_modality | N\/A | N/A | files.Modality |
-| dbgap\_accession\_number | cases.project.dbgap\_accession\_number| files.dbgap\_control\_number | No mapping available |
-| |
-| [Subject](subject_F) |
-| [ResearchSubject](researchsubject_F)|
-| [Specimen](specimen_F) |
-
-(subject_F)=
-#### Subject
-
-All GDC, PDC, and IDC field names use a 'dot' notation to specify the paths. The first word denotes the endpoints that the field is taken from, such as "file(s)" or "case(s)". The rest of the field name is as seen in GDC and PDC documentation for those endpoints. Since we extract all IDC data from one pivot table which is written oriented around files, we use "files" as the endpoint for all fields from IDC.
-
-| Common Data Format (present in CDA) | GDC field name | PDC field name | IDC field name |
-| ----------------------------------- | ---------------------------------------------------------------------- | ---------------------------------- | -------------------------------- |
-| id | file.case.submitter\_id | files.cases.case\_submitter\_id | files.PatientID |
-| identifier.system | GDC | PDC | IDC |
-| identifier.value | file.case.submitter\_id | files.cases.case\_submitter\_id | files.PatientID |
-| species | Homo sapiens | cases.taxon | files.tcia\_species |
-| sex | file.case.demographic.gender | cases.demographics.gender | No available mapping |
-| race | file.case.demographic.race | cases.demographics.race | No available mapping |
-| ethnicity | file.case.demographic.ethnicity | cases.demographics.ethnicity | No available mapping |
-| days\_to\_birth | file.case.demographic.days\_to\_birth | cases.demographics.days\_to\_birth | No available mapping |
-| subject\_associated\_project | Create array of all projects for all ResearchSubjects for this Subject | | files.collection\_id |
-| vital\_status | file.case.demographic.vital\_status | cases.demographic.vital\_status | No available mapping |
-| age\_at\_death | file.case.demographic.days\_to\_death | cases.demographic.days\_to\_death | No available mapping |
-| cause\_of\_death | file.case.demographic.cause\_of\_death | cases.demographic.cause\_of\_death | No available mapping |
-
-(researchsubject_F)=
-#### ResearchSubject
-| Common Data Format (present in CDA) | GDC field name | PDC field name | IDC field name |
-| ----------------------------------- | -------------------------- | ---------------------------- | -------------- |
-| id | file.cases.case\_id | files.cases.case\_id | Not Applicable |
-| identifier.system | GDC | PDC | |
-| identifier.value | file.cases.case\_id | files.cases.case\_id | |
-| primary\_diagnosis\_condition | file.cases.disease\_type | cases.disease\_type | |
-| primary\_diagnosis\_site | file.cases.primary\_site | cases.primary\_site | |
-| member\_of\_research\_project | file.cases.project.project\_id | cases.project\_submitter\_id | |
-| |
-| [Diagnosis](diagnosis_F) | | | |
-
-(diagnosis_F)=
-#### Diagnosis
-| Common Data Format (present in CDA) | GDC field name | PDC field name | IDC field name |
-| ----------------------------------- | ------------------------------------ | ------------------------------------ | -------------- |
-| id | file.cases.diagnoses.diagnosis\_id | cases.diagnoses.diagnosis\_id | Not applicable |
-| identifier.system | GDC | PDC | |
-| identifier.value | file.cases.diagnoses.diagnosis\_id | cases.diagnoses.diagnosis\_id | |
-| primary\_diagnosis | file.cases.diagnoses.primary\_diagnosis | cases.diagnoses.primary\_diagnosis | |
-| age\_at\_diagnosis | file.cases.diagnoses.age\_at\_diagnosis | cases.diagnoses.age\_at\_diagnosis | |
-| morphology | file.cases.diagnoses.morphology | cases.diagnoses.morphology | |
-| grade | file.cases.diagnoses.tumor\_grade | cases.diagnoses.tumor\_grade | |
-| stage | file.cases.diagnoses.tumor\_stage | cases.diagnoses.tumor\_stage | |
-| method\_of\_diagnosis | file.cases.diagnosis.method\_of\_diagnosis| cases.diagnosis.method\_of\_diagnosis| |
-| |
-| [Treatment](treatment_F) | | | |
-
-(treatment_F)=
-#### Treatment
-| Common Data Format (present in CDA) | GDC field name | PDC field name | IDC field name |
-| ----------------------------------- | --------------------------------------------------- | ----------------------------------------------------- | -------------- |
-| id | file.cases.diagnoses.treatments.treatment\_id | cases.diagnoses.treatments.treatment\_id | Not applicable |
-| identifier.system | GDC | PDC | |
-| identifier.value | file.cases.diagnoses.treatments.treatment\_id | cases.diagnoses.treatments.treatment\_id | |
-| treatment\_type | file.cases.diagnoses.treatments.treatment\_type | cases.diagnoses.treatments.treatment\_type | |
-| treatment\_outcome | file.cases.diagnoses.treatments.treatment\_outcome | cases.diagnoses.treatments. treatment\_outcome | |
-| days\_to\_treatment\_start | file.cases.diagnoses.treatments.days\_to\_treatment\_start| cases.diagnoses.treatments.days\_to\_treatment\_start | |
-| days\_to\_treatment\_end | file.cases.diagnoses.treatments.days\_to\_treatment_end | cases.diagnoses.treatments.days\_to\_treatment\_end | |
-| therapeutic\_agent | file.cases.diagnoses.treatments.therapeutic\_agents | cases.diagnoses.treatments.therapeutic\_agents | |
-| treatment\_anatomic\_site | file.cases.diagnoses.treatments.treatment\_anatomic\_site| cases.diagnoses.treatments.treatment\_anatomic\_site | |
-| treatment\_effect | file.cases.diagnoses.treatments.treatment\_effect | cases.diagnoses.treatments.treatment\_effect | |
-| treatment\_end\_reason | file.cases.diagnoses.treatments.reason\_treatment\_ended | cases.diagnoses.treatments.reason\_treatment\_ended | |
-| number\_of\_cycles | file.cases.diagnoses.treatments.number\_of\_cycles | cases.diagnoses.treatments.number\_of\_cycles | |
-
-(specimen_F)=
-#### Specimen
-| Common Data Format (present in CDA) | GDC field name | PDC field name | IDC field name |
-| ----------------------------------- | ---------------------------------------------------- | ----------------------------------------- | -------------- |
-| id | file.cases.samples.sample\_id | cases.samples.sample\_id | Not applicable |
-| | file.cases.samples.portions.portion\_id | cases.samples.aliquots.aliquot\_id | |
-| | file.cases.samples.portions.slide\_id | | |
-| | file.cases.samples.portions.analytes.analyte\_id | | |
-| | file.cases.samples.portions.analytes.aliquots.aliquot\_id | | |
-| identifier.system | GDC | PDC | |
-| identifier.value | file.cases.samples.sample\_id | cases.samples.sample\_id | |
-| | file.cases.samples.portions.portion\_id | cases.samples.aliquots.aliquot\_id | |
-| | file.cases.samples.portions.slide\_id | | |
-| | file.cases.samples.portions.analytes.analyte\_id | | |
-| | file.cases.samples.portions.analytes.aliquots.aliquot\_id | | |
-| associated\_project | file.cases.project.project\_id | cases.project\_submitter\_id | |
-| age\_at\_collection | file.cases.demographic.days\_to\_birth | cases.demographics.days\_to\_birth | |
-| derived\_from\_subject | file.cases.submitter\_id | cases.case\_submitter\_id | |
-| primary\_disease\_type | file.cases.disease\_type | cases.disease\_type | |
-| anatomical\_site | file.cases.samples.biospecimen\_anatomic\_site | cases.samples.biospecimen\_anatomic\_site | |
-| source\_material\_type | file.cases.samples.sample\_type | cases.samples.sample\_type | |
-| specimen\_type | sample, portion, slide, analyte, aliquot | sample, aliquot | |
-| derived\_from\_specimen | file.cases.samples.sample\_id | cases.samples.sample\_id | |
-| | file.cases.samples.portions.portion\_id | | |
-| | file.cases.samples.portions.analytes.analyte\_id | | |
-| derived\_from\_subject | file.cases.submitter\_id | cases.case\_submitter\_id | |
-
diff --git a/docs/source/Troubleshooting.rst b/docs/source/Troubleshooting.rst
deleted file mode 100644
index ea2f513..0000000
--- a/docs/source/Troubleshooting.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-Troubleshooting
-=======
-
-.. _troubleshooting:
-
diff --git a/docs/source/api.rst b/docs/source/api.rst
deleted file mode 100644
index ec94338..0000000
--- a/docs/source/api.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-API
-===
-
-.. autosummary::
- :toctree: generated
-
- lumache
diff --git a/docs/source/conf.py b/docs/source/conf.py
deleted file mode 100644
index 35149c1..0000000
--- a/docs/source/conf.py
+++ /dev/null
@@ -1,44 +0,0 @@
-# Configuration file for the Sphinx documentation builder.
-
-# -- Project information
-
-project = 'Cancer Data Aggregator'
-copyright = '2021, Cancer Data Aggregator'
-author = 'Graziella'
-
-release = '0.1'
-version = '0.1.0'
-
-# -- General configuration
-
-extensions = [
- 'sphinx.ext.duration',
- 'sphinx.ext.doctest',
- 'sphinx.ext.autodoc',
- 'sphinx.ext.autosummary',
- 'sphinx.ext.napoleon',
- 'sphinx.ext.intersphinx',
-# 'sphinx_togglebutton',
- 'myst_parser',
- 'sphinx.ext.imgconverter'
-]
-
-intersphinx_mapping = {
- 'python': ('https://docs.python.org/3/', None),
- 'sphinx': ('https://www.sphinx-doc.org/en/master/', None),
-}
-intersphinx_disabled_domains = ['std']
-
-templates_path = ['_templates']
-
-# -- Options for HTML output
-
-html_theme = 'sphinx_rtd_theme'
-html_static_path = ['_static']
-
-# -- Options for EPUB output
-epub_show_urls = 'footnote'
-
-# -- Options for docstring output
-napoleon_google_docstring = False
-napoleon_numpy_docstring = True
diff --git a/docs/source/contact_info.md b/docs/source/contact_info.md
deleted file mode 100644
index 47a1126..0000000
--- a/docs/source/contact_info.md
+++ /dev/null
@@ -1,59 +0,0 @@
-# Contact Information & Troubleshooting
-
-All feedback can be submitted via email at [cda-developers@broadinstitute.org](mailto:cda-developers@broadinstitute.org).
-
-## How to report a bug as a Tester
-
-One advanage testers have over other end-user is the ability to contact the CDA team via slack. We encourage our testers to use slack to communicate with us for small bugs. However, if bugs is not able to be resolved within an hour, we has that you follow the protocols below and report the bug via github.
-
-If you have a bug or question, you can create an issue in [github](https://github.com/CancerDataAggregator/cda-python/issues).
-When report bugs ask that you please using the following template:
-
-**Describe the bug**
-A clear and concise description of what the bug is.
-
-**To Reproduce**
-Steps to reproduce the behavior:
-1. Go to '...'
-2. Click on '....'
-3. Scroll down to '....'
-4. See error
-
-**Expected behavior**
-A clear and concise description of what you expected to happen.
-
-**Screenshots**
-If applicable, add screenshots to help explain your problem.
-
-**Environment [Production, Integration, Branch_version]
-
-**Desktop (please complete the following information):**
- - OS: [e.g. iOS]
- - Browser [e.g. chrome, safari]
-
-**Additional context**
-Add any other context about the problem here.
-
-.. note::
- Please label you issue with the **bug** label in github
-
-Here is an example bug report: https://github.com/CancerDataAggregator/cda-python/issues/78
-
-## How to request a new feature
-In the same mannor as reporting bugs, we request that you create an issue in [github](https://github.com/CancerDataAggregator/cda-python/issues) and label it **enhancement**.
-
-When requesting a new or an improvement to a feature, we ask that you please using the following template:
-
-**Is your feature request related to a problem? Please describe.**
-A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
-
-**Describe the solution you'd like**
-A clear and concise description of what you want to happen.
-
-**Describe alternatives you've considered**
-A clear and concise description of any alternative solutions or features you've considered.
-
-**Additional context**
-Add any other context or screenshots about the feature request here.
-
-Example feature request: https://github.com/CancerDataAggregator/cda-python/issues/83
diff --git a/docs/source/index.rst b/docs/source/index.rst
deleted file mode 100644
index 231aa16..0000000
--- a/docs/source/index.rst
+++ /dev/null
@@ -1,29 +0,0 @@
-Welcome to CDA's documentation!
-===================================
-
-Integrative cancer research currently is hampered by the fact that important datasets are stored in separate, non-interoperable repositories. The Cancer Data Aggregator (CDA) is being developed to allow researchers to aggregate diverse data types generated by NCI-funded programs, such as the Human Tumor Cell Atlas Network (HTAN) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC). Using the CDA and a harmonized data model developed by the Center for Cancer Data Harmonization (CCDH), users can discover, query, retrieve, and aggregate data according to a variety of search parameters, such as participant, sample, tissue, disease, or study.
-
-In addition to a query engine, the CDA will provide a central repository for basic clinical and biospecimen metadata to serve as a primary resource for the CRDC. It will contain both open- and controlled-access metadata in a structured format to support federation across multiple repositories. This central repository will be accessible via an Application Programming Interface (API) and includes mechanisms for receiving new data and updating existing data. The central repository will eliminate the need for each CRDC repository to store redundant copies of common clinical and biospecimen data. This will avoid discrepancies between individual CRDC repositories that could compromise the CDA’s ability to return accurate results. The CDA will facilitate interoperability within the cancer data ecosystem to make complex datasets available to the research community to perform integrative analysis.
-
-**cdapython** (/c-d-a python/) is a Python library sits on top of the machine generated `CDA Python Client `_ built to make it more pleasant to query the CDA.
-It pulls data from various :doc:`sources ETL`.
-and offers a *simple* and *intuitive* API.
-
-Check out the :doc:`usage` section for further information, including
-how to :ref:`installation`.
-
-.. note::
-
- This project is under active development.
-
-Contents
---------
-
-.. toctree::
-
- Installation
- usage
- ReleaseNotes.md
- ETL.md
- Schema.md
- contact_info.md
diff --git a/docs/source/limits.md b/docs/source/limits.md
deleted file mode 100644
index 2e9b985..0000000
--- a/docs/source/limits.md
+++ /dev/null
@@ -1,26 +0,0 @@
-# Data sources
-
-CDA currently collects data from 3 sources:
-- [GDC](https://portal.gdc.cancer.gov/)
-- [PDC](https://pdc.cancer.gov/pdc/)
-- [IDC](https://portal.imaging.datacommons.cancer.gov/)
-
-and because of this there may be limitations on the information for a given patient.
-
-
-## Data extraction and release information
-To identify the current version and release dates for each of the database, you can run the following command:
-
-```
-r = Q.sql("SELECT option_value FROM `gdc-bq-sample.integration.INFORMATION_SCHEMA.TABLE_OPTIONS` WHERE table_name = 'all_v1'")
-strings = r[0]['option_value'].split('\\n')
-new_strings = []
-
-for string in strings:
- new_string = string.replace('\"', '')
- new_strings.append(new_string)
-print(new_strings)
-```
-
-Which will produce the following output:
-['GDC extraction date - 09/27/2021', 'GDC data release - v30.0', 'PDC extraction date - 09/27/2021', 'PDC data release - v2.1', 'IDC extraction date - 09/27/2021', 'IDC data release - Version 3.0']
diff --git a/docs/source/rtd_images/CCDH Specimen-centric Jun2020.png b/docs/source/rtd_images/CCDH Specimen-centric Jun2020.png
deleted file mode 100644
index 18c17d5..0000000
Binary files a/docs/source/rtd_images/CCDH Specimen-centric Jun2020.png and /dev/null differ
diff --git a/docs/source/rtd_images/CDA MVP Release 1.png b/docs/source/rtd_images/CDA MVP Release 1.png
deleted file mode 100644
index 8bc763a..0000000
Binary files a/docs/source/rtd_images/CDA MVP Release 1.png and /dev/null differ
diff --git a/docs/source/rtd_images/GDCPDCModels.png b/docs/source/rtd_images/GDCPDCModels.png
deleted file mode 100644
index da0c445..0000000
Binary files a/docs/source/rtd_images/GDCPDCModels.png and /dev/null differ
diff --git a/docs/source/rtd_images/github_label.png b/docs/source/rtd_images/github_label.png
deleted file mode 100644
index b83a57a..0000000
Binary files a/docs/source/rtd_images/github_label.png and /dev/null differ
diff --git a/docs/source/usage.rst b/docs/source/usage.rst
deleted file mode 100644
index 3f8ac1f..0000000
--- a/docs/source/usage.rst
+++ /dev/null
@@ -1,1139 +0,0 @@
-=====
-Usage
-=====
-
-
-We will now show you the basic structure of `CDA python` through the
-use of the most common commands:
-
-- ``columns()``: show all available columns in the table,
-- ``unique_terms()``: for a given column show all unique terms,
-- ``Q()``: Executes this query on the public CDA server,
-- ``query()`` : allows you to write long form Q statments with out chaining, and
-- ``Q.sql()``: allows you to enter SQL style queries.
-
-**To begin we will first load all of the library and its methods:**
-
->>> from cdapython import Q, columns, unique_terms, query
-
-
-columns()
------
-``columns(version = 'all_v1', host = None, limit = 100, table = 'integration')``
-
-displays all of the fields that can be queried using the ``Q`` or ``query`` (e.g. ethnicity, tumor stage, disease type, etc.)
-
-**Parameters:**
- - version : str [Optional]
- - version allows you to select different version of SQL (BigQuery) tables to runs queries on; default = 'all_v1'
- - host : str [Optional]
- - host allows you to change the server in which you queries run; default = None (Board Institute)
- - limit : int [Optional]
- - limit allows you to set the number of values that ``columns`` returns; default = 100
- - table : str [Optional]
- table allows you to select with BigQuery table is being searched; default = 'integration'
-**Returns:**
- list
-**Example:**
-
->>> columns() # List column names eg:
-['id',
- 'identifier',
- 'identifier.system',
- 'identifier.value',
- 'species',
- 'sex',
- 'race',
- 'ethnicity',
- 'days_to_birth',
- 'subject_associated_project',
- 'vital_status',
- 'age_at_death',
- 'cause_of_death',
- 'File',
- 'File.id',
- 'File.identifier',
- 'File.identifier.system',
- 'File.identifier.value',
- 'File.label',
- 'File.data_category',
- 'File.data_type',
- 'File.file_format',
- 'File.associated_project',
- 'File.drs_uri',
- 'File.byte_size',
- 'File.checksum',
- 'File.data_modality',
- 'File.imaging_modality',
- 'File.dbgap_accession_number',
- 'ResearchSubject',
- 'ResearchSubject.id',
- 'ResearchSubject.identifier',
- 'ResearchSubject.identifier.system',
- 'ResearchSubject.identifier.value',
- 'ResearchSubject.member_of_research_project',
- 'ResearchSubject.primary_diagnosis_condition',
- 'ResearchSubject.primary_diagnosis_site',
- 'ResearchSubject.Diagnosis',
- 'ResearchSubject.Diagnosis.id',
- 'ResearchSubject.Diagnosis.identifier',
- 'ResearchSubject.Diagnosis.identifier.system',
- 'ResearchSubject.Diagnosis.identifier.value',
- 'ResearchSubject.Diagnosis.primary_diagnosis',
- 'ResearchSubject.Diagnosis.age_at_diagnosis',
- 'ResearchSubject.Diagnosis.morphology',
- 'ResearchSubject.Diagnosis.stage',
- 'ResearchSubject.Diagnosis.grade',
- 'ResearchSubject.Diagnosis.method_of_diagnosis',
- 'ResearchSubject.Diagnosis.Treatment',
- 'ResearchSubject.Diagnosis.Treatment.id',
- 'ResearchSubject.Diagnosis.Treatment.identifier',
- 'ResearchSubject.Diagnosis.Treatment.identifier.system',
- 'ResearchSubject.Diagnosis.Treatment.identifier.value',
- 'ResearchSubject.Diagnosis.Treatment.treatment_type',
- 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
- 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
- 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',
- 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',
- 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',
- 'ResearchSubject.Diagnosis.Treatment.treatment_effect',
- 'ResearchSubject.Diagnosis.Treatment.treatment_end_reason',
- 'ResearchSubject.Diagnosis.Treatment.number_of_cycles',
- 'ResearchSubject.File',
- 'ResearchSubject.File.id',
- 'ResearchSubject.File.identifier',
- 'ResearchSubject.File.identifier.system',
- 'ResearchSubject.File.identifier.value',
- 'ResearchSubject.File.label',
- 'ResearchSubject.File.data_category',
- 'ResearchSubject.File.data_type',
- 'ResearchSubject.File.file_format',
- 'ResearchSubject.File.associated_project',
- 'ResearchSubject.File.drs_uri',
- 'ResearchSubject.File.byte_size',
- 'ResearchSubject.File.checksum',
- 'ResearchSubject.File.data_modality',
- 'ResearchSubject.File.imaging_modality',
- 'ResearchSubject.File.dbgap_accession_number',
- 'ResearchSubject.Specimen',
- 'ResearchSubject.Specimen.id',
- 'ResearchSubject.Specimen.identifier',
- 'ResearchSubject.Specimen.identifier.system',
- 'ResearchSubject.Specimen.identifier.value',
- 'ResearchSubject.Specimen.associated_project',
- 'ResearchSubject.Specimen.age_at_collection',
- 'ResearchSubject.Specimen.primary_disease_type',
- 'ResearchSubject.Specimen.anatomical_site',
- 'ResearchSubject.Specimen.source_material_type',
- 'ResearchSubject.Specimen.specimen_type',
- 'ResearchSubject.Specimen.derived_from_specimen',
- 'ResearchSubject.Specimen.derived_from_subject',
- 'ResearchSubject.Specimen.File',
- 'ResearchSubject.Specimen.File.id',
- 'ResearchSubject.Specimen.File.identifier',
- 'ResearchSubject.Specimen.File.identifier.system',
- 'ResearchSubject.Specimen.File.identifier.value',
- 'ResearchSubject.Specimen.File.label',
- 'ResearchSubject.Specimen.File.data_category',
- 'ResearchSubject.Specimen.File.data_type',
- 'ResearchSubject.Specimen.File.file_format']
-
-
-All of the above fields are what describes the highest entity in the data structure hierarchy – ``Subject`` entity. The first thirteen fields represent ``Subject`` demographic information, while the ``ResearchSubject`` entity contains details that we are used to seeing within the nodes' ``Case`` record.
-
-One of the contributions of the CDA is aggregated ``ResearchSubject`` information. This means that all ``ResearchSubject`` records coming from the same subject are now gathered under the Subject entity. As we know, certain specimens are studied in multiple projects (being part of a single data node or multiple nodes) as different ``ResearchSubject`` entries. Those ``ResearchSubject`` entries are collected as a list under the ``ResearchSubject`` entity. One example of this is the patient record with ``id = TCGA-13-1409`` which contains two ``ResearchSubject`` entries, one from GDC and the other from PDC, and three ``Subject`` entries, and additional entry for IDC.
-
-.. note::
-
- Note that the ``ResearchSubject`` entity is a list of records, as many other entities above are. **There are certain considerations that should be made when creating the queries by using the fields that come from lists, but more about that will follow in examples below**.
-
-The names in the list may look familiar to you, but they may have been renamed or restructured in the CDA. For more information about the field name mappings you can look into :doc:`ETL` . A more direct way to explore and understand the fields is to use the ``unique_terms()`` function:
-
-
-unique_terms()
--------
-``unique_terms(col_name: str, system: str = '', limit: int = 100, host: Optional[str] = None, table: Optional[str] = None)``
-
-displays all non-numeric values that can be searched in a query for a given column.
-
-**Parameters:**
- - col_name : str
- - col_name is the value from the ``columns()`` that you would like a list of searchable terms from (e.g. 'ResearchSubject.primary_disease_site')
- - system : str [Optional]
- - system allows you to determine which data common you would like to search (GDC, PDC, or IDC; see :ref:`limit`)
- - limit : int [Optional]
- - limit allows you to set the number of values that ``columns`` returns; default = 100
- - host : str [Optional]
- - host allows you to change the server in which you queries run; default = None (Broad Institute)
- - table : str [Optional]
- - table allows you to select which Big Query table is being searched; default = 'integration'
-**Returns:**
- list
-**Example:**
-
-
-
-For each searchable field there are set values that can be searched
-(excluding numeric fields). To determine these values the ``unique_terms()`` command is used. For example, if we were interested in searchable disease types at the ResearchSubject level we would type the following:
-
->>> unique_terms("ResearchSubject.primary_diagnosis_condition")
-[None,
- 'Acinar Cell Neoplasms',
- 'Adenomas and Adenocarcinomas',
- 'Adnexal and Skin Appendage Neoplasms',
- 'Basal Cell Neoplasms',
- 'Blood Vessel Tumors',
- 'Breast Invasive Carcinoma',
- 'Chromophobe Renal Cell Carcinoma',
- 'Chronic Myeloproliferative Disorders',
- 'Clear Cell Renal Cell Carcinoma',
- 'Colon Adenocarcinoma',
-...
-
-.. note::
- The results of ``unique_terms()`` may not be the same at different
- level (Subject vs ResearchSubject vs Specimen), so
- ``unique_terms()`` must be searched at the same level on which you plan to run your query.
-
-Additionally, you can specify a particular data node by using the ``system`` argument. For more information on data nodes/data commons see :ref:`ETL`.
-
->>> unique_terms("ResearchSubject.Specimen.source_material_type", system="PDC")
-['Cell Lines',
- 'Normal',
- 'Normal Adjacent Tissue',
- 'Not Reported',
- 'Primary Tumor',
- 'Solid Tissue Normal',
- 'Tumor',
- 'Xenograft Tissue']
-
-.. warning::
- Some columns are array value or have complex values, and do not have ``unique_terms``. Arrays columns contain multiple values; an example of this would be ``File.identifier`` which as comprised of ``system`` (which data common the information is from) and ``value`` (the id for a given file).
-
- .. code-block:: json
-
- {'File': [{'label': '0012f466-075a-4d47-b1d7-e8e63e8b9c99.vep.vcf.gz',
- 'associated_project': ['TCGA-BRCA'],
- 'drs_uri': 'drs://dg.4DFC:0012f466-075a-4d47-b1d7-e8e63e8b9c99',
- **'identifier': [{'system': 'GDC', 'value': '0012f466-075a-4d47-b1d7-e8e63e8b9c99'}]**
- ...
-
- Below is the list of column values that are not supported by ``unique_terms``. Additionally, these columns should not be used in a query.
- - 'File',
- - 'File.identifier',
- - 'identifier',
- - 'ResearchSubject',
- - 'ResearchSubject.Diagnosis',
- - 'ResearchSubject.Diagnosis.Treatment',
- - 'ResearchSubject.Specimen',
- - 'ResearchSubject.Specimen.File',
- - 'ResearchSubject.Specimen.File.identifier',
- - 'ResearchSubject.Specimen.identifier',
- - 'ResearchSubject.identifier',
- - 'subject_associated_project',
- - 'ResearchSubject.Diagnosis.identifier',
- - 'ResearchSubject.Diagnosis.Treatment.identifier',
- - 'ResearchSubject.File',
- - 'ResearchSubject.File.identifier'
-
-Q()
-----
-``Q(query)``
-
-Q lang is a language used to query the cda service directly.
-
-**Parameters:**
- - query : str
- - a query string containing a value from ``columns()`` with an comparison operator (=, !=, <, >) and a numeric/boolean/unique value from ``unique_terms``.
-**Returns:**
- cda-python Q data type
-
-``Q().run``
-
-run(offset = 0, limit = 100, version = 'all_v2_1', host = None, dry_run = False, table = 'gdc-bq-sample.integration', async_call = False)
-
-**Parameters:**
- - async_call : bool
- - async_call allows for
- - table : str
- - table allows you to select which BigQuery table is being searched; default = ‘gdc-bq-sample.integration’
- - version : str
- - version allows you to select which version of the BigQuery table is being searched; default = ‘all_v2_1’
- - offset : int [optional]
- - [description]. Defaults to 0.
- - limit : int, optional):
- - limit allows you to set the number of values that returns per page; default = 100
- - host : URL, [optional]
- - host allows you to change the server in which you queries run; default = None (Board Institute)
- - dry_run : bool, [optional]
- - [description]. Defaults to False.
-
-**Returns:**
- cda-python Q data type
-
-Q Comparison operators
-+++++++
-
-The following comparsion operators can be used with the `Q` or `query` command:
-
-+----------+---------------------------------------------------+---------------+
-| operator |condition description |Q.sql required?|
-+==========+===================================================+===============+
-| = | equals | no |
-+----------+---------------------------------------------------+---------------+
-| != | does not equal | no |
-+----------+---------------------------------------------------+---------------+
-| < | is less than | no |
-+----------+---------------------------------------------------+---------------+
-| > | is greater than | no |
-+----------+---------------------------------------------------+---------------+
-| <= | is less than or equal to | no |
-+----------+---------------------------------------------------+---------------+
-| >= | is less than or equal to | no |
-+----------+---------------------------------------------------+---------------+
-| like | similar to = but allows wildcards ('%', '_', etc) | yes |
-+----------+---------------------------------------------------+---------------+
-| in | compares to a set | yes |
-+----------+---------------------------------------------------+---------------+
-
-additionally, more complex SQL can be used with the `Q.sql()`_ command.
-
-**Example:**
-
-.. note::
-
- Any given part of a query is expressed as a string of three parts separated by spaces. **Therefore, there must be a space on both sides of the comparsion operator**. The first part of the query is interpreted as a **column name**, the second as a **comparator operator** and the third part as a **value**. If the value is a string, it needs to be put in double quotes.
-
-Now, let's dive into the querying!
-
-We can start by getting the record for ``id = TCGA-13-1409`` that we mentioned earlier:
-
->>> q = Q('id = "TCGA-13-1409"') # note the double quotes for the string value
->>> r = q.run()
->>> print(r)
-Getting results from database
-Total execution time: 1304 ms
-QueryID: 243b307b-776b-4427-a8b3-eacb9a87b8d6
-Query: SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1 WHERE (all_v2_1.id = 'TCGA-13-1409')
-Offset: 0
-Count: 1
-Total Row Count: 1
-More pages: False
-
-We've discussed ``Q`` but not the ``.run()`` method; ``.run()`` must
-be called to actually process your query. After calling ``print()`` on
-the query result variable we see that we have a single Subject record as a result, which is what we expect.
-
-Let's take a look at the results:
-
-
->>> r[0]
-{'id': 'TCGA-13-1409',
- 'identifier': [{'system': 'GDC', 'value': 'TCGA-13-1409'},
- {'system': 'PDC', 'value': 'TCGA-13-1409'},
- {'system': 'IDC', 'value': 'TCGA-13-1409'}],
- 'species': 'Homo sapiens',
- 'sex': 'female',
- 'race': 'white',
- 'ethnicity': 'not hispanic or latino',
- 'days_to_birth': '-26836',
- 'subject_associated_project': ['TCGA-OV',
- 'CPTAC-TCGA',
- 'CPTAC-TCGA',
- 'tcga_ov'],
- 'vital_status': 'Dead',
- 'age_at_death': '1742',
- 'cause_of_death': None,
- 'File': [{'id': '6850305a-e067-49fa-b617-0a4f32928352',
- 'identifier': [{'system': 'GDC',
- 'value': '6850305a-e067-49fa-b617-0a4f32928352'}],
- 'label': '6850305a-e067-49fa-b617-0a4f32928352.vep.vcf.gz',
- 'data_category': 'Simple Nucleotide Variation',
- 'data_type': 'Annotated Somatic Mutation',
- 'file_format': 'VCF',
- 'associated_project': 'TCGA-OV',
- 'drs_uri': 'drs://dg.4DFC:6850305a-e067-49fa-b617-0a4f32928352',
- 'byte_size': '142504',
- 'checksum': '0905d8fe02dd007065629983be81dd72',
- 'data_modality': 'Genomic',
- 'imaging_modality': None,
- 'dbgap_accession_number': None},
- {'id': '14a0766c-6ca4-47bb-ac70-62133c30c1c5',
- 'identifier': [{'system': 'GDC',
- 'value': '14a0766c-6ca4-47bb-ac70-62133c30c1c5'}],
- 'label': 'OV.focal_score_by_genes.txt',
- 'data_category': 'Copy Number Variation',
- 'data_type': 'Gene Level Copy Number Scores',
- 'file_format': 'TXT',
- 'associated_project': 'TCGA-OV',
- 'drs_uri': 'drs://dg.4DFC:14a0766c-6ca4-47bb-ac70-62133c30c1c5',
- 'byte_size': '26280573',
- 'checksum': '22e40a89cdeebbc162896f1cdfe7e55e',
- 'data_modality': 'Genomic',
- 'imaging_modality': None,
- 'dbgap_accession_number': None},
- {'id': '2e6f24c1-f5a3-4da4-83bf-457436d4927e',
- 'identifier': [{'system': 'GDC',
- 'value': '2e6f24c1-f5a3-4da4-83bf-457436d4927e'}],
- 'label': '2e6f24c1-f5a3-4da4-83bf-457436d4927e.vcf',
- 'data_category': 'Simple Nucleotide Variation',
- 'data_type': 'Raw Simple Somatic Mutation',
- 'file_format': 'VCF',
- 'associated_project': 'TCGA-OV',
- 'drs_uri': 'drs://dg.4DFC:2e6f24c1-f5a3-4da4-83bf-457436d4927e',
- 'byte_size': '2679669',
- 'checksum': '4ec46657a26fd3bcc27ca8fa856a591a',
- 'data_modality': 'Genomic',
- 'imaging_modality': None,
- 'dbgap_accession_number': None},
- ...
- 'ResearchSubject': [{'id': '18e0e996-8f23-4f53-94a5-dde38b550863',
- 'identifier': [{'system': 'GDC',
- 'value': '18e0e996-8f23-4f53-94a5-dde38b550863'}],
- 'member_of_research_project': 'TCGA-OV',
- 'primary_diagnosis_condition': 'Cystic, Mucinous and Serous Neoplasms',
- 'primary_diagnosis_site': 'Ovary',
- 'Diagnosis': [{'id': '6b0f33e6-884d-5a93-8335-9f55569790a7',
- 'identifier': [{'system': 'GDC',
- 'value': '6b0f33e6-884d-5a93-8335-9f55569790a7'}],
- 'primary_diagnosis': 'Serous cystadenocarcinoma, NOS',
- 'age_at_diagnosis': '26836',
- 'morphology': '8441/3',
- 'stage': None,
- 'grade': 'not reported',
- 'method_of_diagnosis': None,
- 'Treatment': [{'id': '1140ff80-4d83-58f4-b151-0737143a0984',
- 'identifier': [{'system': 'GDC',
- 'value': '1140ff80-4d83-58f4-b151-0737143a0984'}],
- 'treatment_type': 'Pharmaceutical Therapy, NOS',
- 'treatment_outcome': None,
- 'days_to_treatment_start': None,
- 'days_to_treatment_end': None,
- 'therapeutic_agent': None,
- 'treatment_anatomic_site': None,
- 'treatment_effect': None,
- 'treatment_end_reason': None,
- 'number_of_cycles': None},
- {'id': 'c9c78335-6d3f-52a5-92a9-c41ccbd8d4d8',
- 'identifier': [{'system': 'GDC',
- 'value': 'c9c78335-6d3f-52a5-92a9-c41ccbd8d4d8'}],
- 'treatment_type': 'Radiation Therapy, NOS',
- 'treatment_outcome': None,
- 'days_to_treatment_start': None,
- 'days_to_treatment_end': None,
- 'therapeutic_agent': None,
- 'treatment_anatomic_site': None,
- 'treatment_effect': None,
- 'treatment_end_reason': None,
- 'number_of_cycles': None}]}],
- 'File': [{'id': '6850305a-e067-49fa-b617-0a4f32928352',
- 'identifier': [{'system': 'GDC',
- 'value': '6850305a-e067-49fa-b617-0a4f32928352'}],
- 'label': '6850305a-e067-49fa-b617-0a4f32928352.vep.vcf.gz',
- 'data_category': 'Simple Nucleotide Variation',
- 'data_type': 'Annotated Somatic Mutation',
- 'file_format': 'VCF',
- 'associated_project': 'TCGA-OV',
- 'drs_uri': 'drs://dg.4DFC:6850305a-e067-49fa-b617-0a4f32928352',
- 'byte_size': '142504',
- 'checksum': '0905d8fe02dd007065629983be81dd72',
- 'data_modality': 'Genomic',
- 'imaging_modality': None,
- 'dbgap_accession_number': None},
- ...
- 'Specimen': [{'id': '930c3552-f960-4a57-aa35-b504807a9676',
- 'identifier': [{'system': 'GDC',
- 'value': '930c3552-f960-4a57-aa35-b504807a9676'}],
- 'associated_project': 'TCGA-OV',
- 'age_at_collection': '-26836',
- 'primary_disease_type': 'Cystic, Mucinous and Serous Neoplasms',
- 'anatomical_site': None,
- 'source_material_type': 'Primary Tumor',
- 'specimen_type': 'sample',
- 'derived_from_specimen': 'initial specimen',
- 'derived_from_subject': 'TCGA-13-1409',
- 'File': [{'id': '04da990e-67a3-4ead-ab08-448c7118624c',
- 'identifier': [{'system': 'GDC',
- 'value': '04da990e-67a3-4ead-ab08-448c7118624c'}],
- 'label': 'TCGA.OV.varscan.04da990e-67a3-4ead-ab08-448c7118624c.DR-10.0.protected.maf.gz',
- 'data_category': 'Simple Nucleotide Variation',
- 'data_type': 'Aggregated Somatic Mutation',
- 'file_format': 'MAF',
- 'associated_project': 'TCGA-OV',
- 'drs_uri': 'drs://dg.4DFC:04da990e-67a3-4ead-ab08-448c7118624c',
- 'byte_size': '216647924',
- 'checksum': '431606691f638bb07d9028e6605539c7',
- 'data_modality': 'Genomic',
- 'imaging_modality': None,
- 'dbgap_accession_number': None},
- ...
-
-The record is pretty large, so we'll print out identifier values for each ``ResearchSubject`` to confirm that we have one ResearchSubject that comes from GDC, and one that comes from PDC:
-
->>> for research_subject in r[0]['ResearchSubject']:
->>> print(research_subject['identifier'])
-[{'system': 'GDC', 'value': '18e0e996-8f23-4f53-94a5-dde38b550863'}]
-[{'system': 'PDC', 'value': '3a36a497-63d7-11e8-bcf1-0a2705229b82'}]
-
-The values represent ResearchSubject IDs and are equivalent to case_id
-values in some data commons.
-
-.. warning::
- In some instances the results will return multiple pages, if this is the case you must include ``next_page()`` in you loop. An example of looping with ``next_page()`` can be found here.
-
-Now that we can create a query with ``Q()`` function, let's see how we can combine multiple conditions.
-
-And, Or and From operators
-++++
-There are three operators available:
- * ``And()``
- * ``Or()``
- * ``From()``
-
-The following examples show how those operators work in practice.
-
-
-Example Query 1: And
-+++++++
-**Find data for subjects who were diagnosed after the age of 50 and who were investigated as part of the TCGA-OV project.**
-
-.. code-block:: python
-
- >>> q1 = Q('ResearchSubject.Diagnosis.age_at_diagnosis > 50*365')
- >>> q2 = Q('ResearchSubject.member_of_research_project = "TCGA-OV"')
-
- >>> q = q1.And(q2)
- >>> r = q.run()
-
- >>> print(r)
-
- Getting results from database
-
- Total execution time: 17860 ms
-
- QueryID: eb85cf0d-3edf-4310-9e52-de166ee58b7e
- Query: SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE ((_Diagnosis.age_at_diagnosis > 50*365) AND (_ResearchSubject.member_of_research_project = 'TCGA-OV'))
- Offset: 0
- Count: 100
- Total Row Count: 461
- More pages: True
-
-
-Example Query 2: And continued
-+++++++
-**Find data for donors with melanoma (Nevi and Melanomas) diagnosis and who were diagnosed before the age of 30.**
-
-.. code-block:: python
-
- >>> q1 = Q('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas"')
- >>> q2 = Q('ResearchSubject.Diagnosis.age_at_diagnosis < 30*365')
-
- >>> q = q1.And(q2)
- >>> r = q.run()
-
- >>> print(r)
-
- Getting results from database
-
- Total execution time: 11287 ms
-
- QueryID: 02c118d4-08ac-442f-bc79-71b794bab6bc
- Query: SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE ((_Specimen.primary_disease_type = 'Nevi and Melanomas') AND (_Diagnosis.age_at_diagnosis < 30*365))
- Offset: 0
- Count: 100
- Total Row Count: 663
- More pages: False
-
-
-In addition, we can check how many records come from particular systems by adding one more condition to the query:
-
-.. code-block:: python
-
- >>> q1 = Q('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas"')
- >>> q2 = Q('ResearchSubject.Diagnosis.age_at_diagnosis < 30*365')
- >>> q3 = Q('ResearchSubject.Specimen.identifier.system = "GDC"')
-
- >>> q = q1.And(q2.And(q3))
- >>> r = q.run()
-
- >>> print(r)
-
-
- Getting results from database
-
- Total execution time: 9604 ms
-
- QueryID: 2cd1f165-f6f5-49e4-b699-b4df191a540f
- Query: SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis, UNNEST(_Specimen.identifier) AS _identifier WHERE ((_Specimen.primary_disease_type = 'Nevi and Melanomas') AND ((_Diagnosis.age_at_diagnosis < 30*365) AND (_identifier.system = 'GDC')))
- Offset: 0
- Count: 100
- Total Row Count: 663
- More pages: False
-
-
-By comparing the ``Count`` value of the two results we can see that all the Subjects returned in the initial query are coming from the GDC.
-
-To explore the results further, we can fetch the Subject JSON objects by iterating through the results:
-
-.. code-block:: python
-
- >>> projects = set()
-
- >>> for subject in r:
- >>> research_subjects = subject['ResearchSubject']
- >>> for rs in research_subjects:
- >>> projects.add(rs['member_of_research_project'])
-
- >>> print(projects)
- {'FM-AD', 'TCGA-SKCM'}
-
-
-The output shows the projects where Nevi and Melanomas cases appear.
-
-Example Query 3: Or
-+++++++
-
-**Identify all samples that meet the following conditions:**
-
-* **Sample is from primary tumor**
-* **Disease is ovarian or breast cancer**
-* **Subjects are females under the age of 60 years**
-
-.. code-block:: python
-
- >>> tumor_type = Q('ResearchSubject.Specimen.source_material_type = "Primary Tumor"')
- >>> disease1 = Q('ResearchSubject.primary_disease_site = "Ovary"')
- >>> disease2 = Q('ResearchSubject.primary_disease_site = "Breast"')
- >>> demographics1 = Q('sex = "female"')
- >>> demographics2 = Q('days_to_birth > -60*365') # note that days_to_birth is a negative value
-
- >>> q1 = tumor_type.And(demographics1.And(demographics2))
- >>> q2 = disease1.Or(disease2)
- >>> q = q1.And(q2)
-
- >>> r = q.run()
- >>> print(r)
-
- Getting results from database
-
- Total execution time: 20529 ms
-
- QueryID: 2b325482-f764-4675-aebe-43f7e8d4004a
- Query: SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen WHERE (((_Specimen.source_material_type = 'Primary Tumor') AND ((all_v2_1.sex = 'female') AND (all_v2_1.days_to_birth > -60*365))) AND ((_ResearchSubject.primary_diagnosis_site = 'Ovary') OR (_ResearchSubject.primary_diagnosis_site = 'Breast')))
- Offset: 0
- Count: 100
- Total Row Count: 28040
- More pages: True
-
-
-
-In this case, we have a result that contains more than 100 records which is the default page size. To load the next 100 records, we can use the ``next_page()`` method:
-
-.. code-block:: python
-
- >>> r2 = r.next_page()
-
- >>> print(r2)
-
- QueryID: 92f1a560-5385-49d9-a477-286c16f7f67c
- Query: SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen WHERE (((_Specimen.source_material_type = 'Primary Tumor') AND ((all_v2_1.sex = 'female') AND (all_v2_1.days_to_birth > -60*365))) AND ((_ResearchSubject.primary_diagnosis_site = 'Ovary') OR (_ResearchSubject.primary_diagnosis_site = 'Breast')))
- Offset: 100
- Count: 100
- Total Row Count: 28040
- More pages: True
-
-
-Alternatively, we can use the ``offset`` argument to specify the record to start from:
-
-.. code-block:: python
- ...
- >>> r = q.run(offset=100)
- >>> print(r)
-
- Getting results from database
-
- Total execution time: 4278 ms
-
- QueryID: ee2150d8-11fb-4720-a0b3-0352f2d4a38f
- Query: SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen WHERE (((_Specimen.source_material_type = 'Primary Tumor') AND ((all_v2_1.sex = 'female') AND (all_v2_1.days_to_birth > -60*365))) AND ((_ResearchSubject.primary_diagnosis_site = 'Ovary') OR (_ResearchSubject.primary_diagnosis_site = 'Breast')))
- Offset: 100
- Count: 100
- Total Row Count: 28040
- More pages: True
-
-
-Example Query 4: From
-+++++
-
-**Find data for donors with "Ovarian Serous Cystadenocarcinoma" with proteomic and genomic data.**
-
-.. note::
- **Disease type values denoting the same disease groups can be completely different between different systems. This is where CDA features come into play.** We first start by exploring the values available for this particular field in both systems.
-
->>> unique_terms('ResearchSubject.primary_diagnosis_condition', system="GDC",limit=10)
-[None,
- 'Acinar Cell Neoplasms',
- 'Adenomas and Adenocarcinomas',
- 'Adnexal and Skin Appendage Neoplasms',
- 'Basal Cell Neoplasms',
- 'Blood Vessel Tumors',
- 'Chronic Myeloproliferative Disorders',
- 'Complex Epithelial Neoplasms',
- 'Complex Mixed and Stromal Neoplasms',
- 'Cystic, Mucinous and Serous Neoplasms']
-
-
-Since “Ovarian Serous Cystadenocarcinoma” doesn’t appear in GDC values let's take a look into the PDC:
-
->>> unique_terms('ResearchSubject.primary_diagnosis_condition', system="PDC")
-['Acute Myeloid Leukemia',
- 'Breast Invasive Carcinoma',
- 'Chromophobe Renal Cell Carcinoma',
- 'Clear Cell Renal Cell Carcinoma',
- 'Colon Adenocarcinoma',
- 'Early Onset Gastric Cancer',
- 'Glioblastoma',
- 'Head and Neck Squamous Cell Carcinoma',
- 'Hepatocellular Carcinoma ',
- 'Lung Adenocarcinoma',
- 'Lung Squamous Cell Carcinoma',
- 'Oral Squamous Cell Carcinoma',
- 'Other',
- 'Ovarian Serous Cystadenocarcinoma',
- 'Pancreatic Ductal Adenocarcinoma',
- 'Papillary Renal Cell Carcinoma',
- 'Pediatric/AYA Brain Tumors',
- 'Rectum Adenocarcinoma',
- 'Uterine Corpus Endometrial Carcinoma']
-
-After examining the output, we see that this term does appear at the PDC. Hence, if we could first identify the data that has research subjects found within the PDC that have this particular disease type, and then further narrow down the results to include only the portion of the data that is present in GDC, we could get the records that we are looking for.
-
-.. code-block:: python
-
- >>> q1 = Q('ResearchSubject.primary_diagnosis_condition = "Ovarian Serous Cystadenocarcinoma"')
- >>> q2 = Q('ResearchSubject.identifier.system = "PDC"')
- >>> q3 = Q('ResearchSubject.identifier.system = "GDC"')
-
- >>> q = q3.From(q1.And(q2))
- >>> r = q.run()
-
- >>> print(r)
- Getting results from database
-
- Total execution time: 35006 ms
-
- QueryID: a2ce5a91-bca5-411e-ad51-b6039ced6d5e
- Query: SELECT all_v2_1.* FROM (SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE ((_ResearchSubject.primary_diagnosis_condition = 'Ovarian Serous Cystadenocarcinoma') AND (_identifier.system = 'PDC'))) AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE (_identifier.system = 'GDC')
- Offset: 0
- Count: 100
- Total Row Count: 275
- More pages: True
-
-As you can see, this is achieved by utilizing ``From`` operator. The ``From`` operator allows us to create queries from results of other
-queries. This is particularly useful when working with conditions that involve a single field which can take multiple different values for
-different items in a list that is being part of, e.g. we need ``ResearchSubject.identifier.system`` to be both “PDC” and “GDC” for a
-single Subject. In such cases, the ``And`` operator can’t help because it will return those entries where the field takes both values, ie.,
-zero entries.
-
- >>> r = q1.run()
- >>> r = q1.run(limit=2) # Limit to two results per page
-
- >>> r.sql # Return SQL string used to generate the query e.g.
- "SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject WHERE (_ResearchSubject.primary_diagnosis_condition = 'Ovarian Serous Cystadenocarcinoma')"
-
- >>> print(r) # Prints some brief information about the result page eg:
- QueryID: 0d080ca0-1298-4da1-8654-593c92fad1f0
- Query: SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject WHERE (_ResearchSubject.primary_diagnosis_condition = 'Ovarian Serous Cystadenocarcinoma')
- Offset: 0
- Count: 2
- Total Row Count: 283
- More pages: True
-
- >>> r[0] # Returns nth result of this page as a Python dict e.g.
- {'id': 'TCGA-61-1724',
- 'identifier': [{'system': 'GDC', 'value': 'TCGA-61-1724'},
- {'system': 'PDC', 'value': 'TCGA-61-1724'}],
- 'species': 'Homo sapiens',
- 'sex': 'female',
- 'race': 'white',
- 'ethnicity': 'not hispanic or latino',
- 'days_to_birth': '-17168',
- 'subject_associated_project': ['TCGA-OV', 'CPTAC-TCGA', 'CPTAC-TCGA'],
- 'vital_status': 'Dead',
- 'age_at_death': '637',
- 'cause_of_death': None,
- 'File': [{'id': '14a0766c-6ca4-47bb-ac70-62133c30c1c5',
- 'identifier': [{'system': 'GDC',
- 'value': '14a0766c-6ca4-47bb-ac70-62133c30c1c5'}],
- 'label': 'OV.focal_score_by_genes.txt',
- 'data_category': 'Copy Number Variation',
- 'data_type': 'Gene Level Copy Number Scores',
- 'file_format': 'TXT',
- 'associated_project': 'TCGA-OV',
- 'drs_uri': 'drs://dg.4DFC:14a0766c-6ca4-47bb-ac70-62133c30c1c5',
- 'byte_size': '26280573',
- 'checksum': '22e40a89cdeebbc162896f1cdfe7e55e',
- 'data_modality': 'Genomic',
- 'imaging_modality': None,
- 'dbgap_accession_number': None},
- ...
-
- >>> r.pretty_print(0) # Prints the nth result nicely
- {
- "id": "TCGA-61-1724",
- "identifier": [
- {
- "system": "GDC",
- "value": "TCGA-61-1724"
- },
- {
- "system": "PDC",
- "value": "TCGA-61-1724"
- }
- ],
- "species": "Homo sapiens",
- "sex": "female",
- "race": "white",
- "ethnicity": "not hispanic or latino",
- "days_to_birth": "-17168",
- "subject_associated_project": [
- "TCGA-OV",
- "CPTAC-TCGA",
- "CPTAC-TCGA"
- ],
- "vital_status": "Dead",
- "age_at_death": "637",
- "cause_of_death": null,
- "File": [
- {
- "id": "14a0766c-6ca4-47bb-ac70-62133c30c1c5",
- "identifier": [
- {
- "system": "GDC",
- "value": "14a0766c-6ca4-47bb-ac70-62133c30c1c5"
- }
- ],
- "label": "OV.focal_score_by_genes.txt",
- "data_category": "Copy Number Variation",
- "data_type": "Gene Level Copy Number Scores",
- "file_format": "TXT",
- "associated_project": "TCGA-OV",
- "drs_uri": "drs://dg.4DFC:14a0766c-6ca4-47bb-ac70-62133c30c1c5",
- "byte_size": "26280573",
- "checksum": "22e40a89cdeebbc162896f1cdfe7e55e",
- "data_modality": "Genomic",
- "imaging_modality": null,
- "dbgap_accession_number": null
- },
- ...
-
- >>> r2 = r.next_page() # Fetches the next page of results
- >>> print(r2)
- QueryID: 0d080ca0-1298-4da1-8654-593c92fad1f0
- Query: SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject WHERE (_ResearchSubject.primary_diagnosis_condition = 'Ovarian Serous Cystadenocarcinoma')
- Offset: 2
- Count: 2
- Total Row Count: 283
- More pages: True
-
-
-Example Query 5: From continued (IDC)
-+++++
-
-**Find data for donors with "Ovarian Serous Cystadenocarcinoma" with proteomic and imaging data.**
-
-Let's repeat the previous query, but this time identify cases that are
-also in IDC. As noted before, the disease type value denoting the same disease groups can be completely different between different systems. So let's explore the values available for this particular field in IDC.
-
->>> unique_terms('ResearchSubject.primary_disease_type', system="IDC",limit=10)
-[]
-
-Oh no! looks like we have an empty set. This is because IDC does not have `ResearchSubject` (or Specimen) intities, only Subject intities (see .. ref:: here `ETL` for more information). So, let try the same code as `Example Query 4: From`_ but change the ``ResearchSubject.identifier.system`` to **IDC** instead of **GDC**.
-
-.. code-block:: python
-
- q1 = Q('ResearchSubject.primary_diagnosis_condition = "Ovarian Serous Cystadenocarcinoma"')
-
- q2 = Q('ResearchSubject.identifier.system = "PDC"')
- q3 = Q('ResearchSubject.identifier.system = "IDC"')
-
- q = q3.From(q1.And(q2))
- r = q.run()
-
- print(r)
-
- Getting results from database
-
- Total execution time: 8746 ms
-
- QueryID: fc470d8d-a23d-4711-a79e-101226253108
- Query: SELECT all_v2_1.* FROM (SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE ((_ResearchSubject.primary_diagnosis_condition = 'Ovarian Serous Cystadenocarcinoma') AND (_identifier.system = 'PDC'))) AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE (_identifier.system = 'IDC')
- Offset: 0
- Count: 0
- Total Row Count: 0
- More pages: False
-
-
-Hmm, zero results. Looks like we made a similar mistake and once again included `ResearchSubject`. If we look at the available searchable fields again using ``columns()``, we will see that there is another field named ``identifier.system`` at the Subject level. So, let's try that:
-
-.. code-block:: python
-
- q1 = Q('ResearchSubject.primary_diagnosis_condition = "Ovarian Serous Cystadenocarcinoma"')
- q2 = Q('ResearchSubject.identifier.system = "PDC"')
- q3 = Q('identifier.system = "IDC"')
-
- q = q3.From(q1.And(q2))
- r = q.run()
-
- print(r)
-
- Getting results from database
-
- Total execution time: 17130 ms
-
- QueryID: 92c68759-8516-4b12-bbcd-4554495f4748
- Query: SELECT all_v2_1.* FROM (SELECT all_v2_1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE ((_ResearchSubject.primary_diagnosis_condition = 'Ovarian Serous Cystadenocarcinoma') AND (_identifier.system = 'PDC'))) AS all_v2_1, UNNEST(identifier) AS _identifier WHERE (_identifier.system = 'IDC')
- Offset: 0
- Count: 37
- Total Row Count: 37
- More pages: False
-
-
-After a quick fix we now have 37 cases.
-
-
-Example query 6: Return all data
-++++
-
-In some instances you may want to return all of the data to build/process your own database. This can be done by queries for data in any of the Data Commons using the ``identifier.system`` columns and ``OR`` operator.
-
-.. code-block:: python
-
- q = query('identifier.system = "GDC" OR identifier.system = "PDC" OR identifier.system = "IDC"')
- r = q.run()
- r
-
- Getting results from database
-
- Total execution time: 25049 ms
-
- QueryID: 211bf374-62bd-477e-8bc6-5c7954eb587f
- Query: SELECT all_v1.* FROM gdc-bq-sample.integration.all_v1 AS all_v1, UNNEST(identifier) AS _identifier WHERE (((_identifier.system = 'GDC') OR (_identifier.system = 'PDC')) OR (_identifier.system = 'IDC'))
- Offset: 0
- Count: 100
- Total Row Count: 104731
- More pages: True
-
-query()
------
-
-To ease the query writing process, we have also implemented ``query``
-which allows ``AND``, ``OR`` and ``FROM`` to be included in the query
-string without needing to use operators. The following `Q` query:
-
-.. code-block:: python
-
- >>> q1 = Q('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas"')
- >>> q2 = Q('ResearchSubject.Diagnosis.age_at_diagnosis < 30*365')
- >>> q3 = Q('ResearchSubject.Specimen.identifier.system = "GDC"')
-
- >>> q = q1.And(q2.And(q3))
-
-can be rewritten using the `query` function:
-
->>> query('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas" AND ResearchSubject.Diagnosis.age_at_diagnosis < 30*365 AND ResearchSubject.identifier.system = "GDC"')
->>> result = q1.run()
-
-Q.sql()
------
-
-In some cases more complex queries are required, and for that purpose
-we have implemented ``Q.sql()`` which accepts a SQL-style query
-
-.. code-block:: python
-
- r1 = Q.sql("""
- SELECT
- *
- FROM gdc-bq-sample.cda_mvp.v1, UNNEST(ResearchSubject) AS _ResearchSubject
- WHERE (_ResearchSubject.primary_disease_type = 'Adenomas and Adenocarcinomas')
- """)
-
- >>> r1.pretty_print(0)
- { 'Diagnosis': [],
- 'ResearchSubject': [ { 'Diagnosis': [],
- 'Specimen': [],
- 'associated_project': 'CGCI-HTMCP-CC',
- 'id': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3',
- 'identifier': [ { 'system': 'GDC',
- 'value': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3'}],
- 'primary_disease_site': 'Cervix uteri',
- 'primary_disease_type': 'Adenomas and '
- 'Adenocarcinomas'}],
- 'Specimen': [],
- 'associated_project': 'CGCI-HTMCP-CC',
- 'days_to_birth': None,
- 'ethnicity': None,
- 'id': 'HTMCP-03-06-02177',
- 'id_1': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3',
- 'identifier': [ { 'system': 'GDC',
- 'value': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3'}],
- 'primary_disease_site': 'Cervix uteri',
- 'primary_disease_type': 'Adenomas and Adenocarcinomas',
- 'race': None,
- 'sex': None}
-
-Test queries
-----
-
-Test query 1
-+++++
-
-**Find data from all Subjects who have been treated with "Radiation Therapy, NOS" and have both genomic and proteomic data.**
-
-.. toggle-header::
-
- :header: Example 1 **Show/Hide Code**
-
- .. code-block:: python
-
- q1 = Q('ResearchSubject.Diagnosis.Treatment.treatment_type = "Radiation Therapy, NOS"')
- q2 = Q('ResearchSubject.identifier.system = "PDC"')
- q3 = Q('ResearchSubject.identifier.system = "GDC"')
-
- q = q2.From(q1.And(q3))
- r = q.run()
-
- print(r)
-
- Getting results from database
-
- Total execution time: 27414 ms
-
- QueryID: a8eabfc7-7258-45cb-8570-763ec4d1926c
- Query: SELECT all_v1.* FROM (SELECT all_v1.* FROM gdc-bq-sample.integration.all_v1 AS all_v1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis, UNNEST(_Diagnosis.Treatment) AS _Treatment, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE ((_Treatment.treatment_type = 'Radiation Therapy, NOS') AND (_identifier.system = 'GDC'))) AS all_v1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE (_identifier.system = 'PDC')
- Offset: 0
- Count: 100
- Total Row Count: 369
- More pages: True
-
-
-Test query 2
-+++++
-
-**Find data from TCGA-BRCA project, with donors over the age of 50 with imaging data**
-
-.. code-block:: python
- q1 = Q('ResearchSubject.associated_project = "TCGA-BRCA"')
- q2 = Q('days_to_birth > -50*365')
- q3 = Q('identifier.system = "IDC"')
-
- q = q3.From(q1.And(q2))
- r = q.run()
-
- print(r)
-
- Getting results from database
-
- Total execution time: 24125 ms
-
- QueryID: a5de2545-2b5e-476c-9e92-b768d058f603
- Query: SELECT all_v1.* FROM (SELECT all_v1.* FROM gdc-bq-sample.integration.all_v1 AS all_v1, UNNEST(ResearchSubject) AS _ResearchSubject WHERE ((_ResearchSubject.associated_project = 'TCGA-BRCA') AND (all_v1.days_to_birth < -50*365))) AS all_v1, UNNEST(identifier) AS _identifier WHERE (_identifier.system = 'IDC')
- Offset: 0
- Count: 88
- Total Row Count: 88
- More pages: False
-
-..
- Pointing to a custom CDA instance
- ----
-
- ``.run()`` will execute the query on the public `CDA API _.
-
- ``.run("http://localhost:8080")`` will execute the query on a CDA server running at
- ``http://localhost:8080``.
-
-Quick Explanation on UNNEST usage in BigQuery
-----
-
-Using Q in the CDA client will echo the generated SQL statement that may contain multiple `UNNEST` inclusions
-when including a dot(.) structure.
-UNNEST is similar to unwind in which embedded data structures must be flattened to appear in a table or Excel file.
-Note; The following call using the SQL endpoint is not the preferred method to execute a nested attribute query in BigQuery.
-The Q language DSL abstracts the required unnesting involved in
-querying a Record. In BigQuery, structures must be represented in an UNNEST
-syntax such that, for example,
-``A.B.C.D`` must be unwound in order to ``SELECT (_C.D)``, as follows:
-
-.. code-block:: sql
-
- SELECT (_C.D)
- from TABLE, UNNEST(A) AS _A, UNNEST(_A.B) as _B, UNNEST(_B.C) as _C
-
-
-``ResearchSubject.Specimen.source_material_type`` represents a complex record that needs to unwound in SQL syntax to be queried on properly when using SQL.
-
-.. code-block:: sql
-
- SELECT DISTINCT(_Specimen.source_material_type)
- FROM gdc-bq-sample.cda_mvp.v3,
- UNNEST(ResearchSubject) AS _ResearchSubject,
- UNNEST(_ResearchSubject.Specimen) AS _Specimen
-
-
-Test query answers
-----
-
-Test query 1
-+++++
-**Find data from all Subjects who have been treated with "Radiation Therapy, NOS" and have both genomic and proteomic data.**
-
-.. code-block:: python
-
- q1 = Q('ResearchSubject.Diagnosis.Treatment.treatment_type = "Radiation Therapy, NOS"')
- q2 = Q('ResearchSubject.identifier.system = "PDC"')
- q3 = Q('ResearchSubject.identifier.system = "GDC"')
-
- q = q2.From(q1.And(q3))
- r = q.run()
-
- print(r)
-
- Getting results from database
-
- Total execution time: 27414 ms
-
- QueryID: a8eabfc7-7258-45cb-8570-763ec4d1926c
- Query: SELECT all_v1.* FROM (SELECT all_v1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis, UNNEST(_Diagnosis.Treatment) AS _Treatment, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE ((_Treatment.treatment_type = 'Radiation Therapy, NOS') AND (_identifier.system = 'GDC'))) AS all_v1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE (_identifier.system = 'PDC')
- Offset: 0
- Count: 100
- Total Row Count: 369
- More pages: True
-
-
-Test query 2
-+++++
-
-**Find data from TCGA-BRCA project, with donors over the age of 50 with imaging data**
-
-.. code-block:: python
-
- q1 = Q('ResearchSubject.associated_project = "TCGA-BRCA"')
- q2 = Q('days_to_birth > -50*365')
- q3 = Q('identifier.system = "IDC"')
-
- q = q3.From(q1.And(q2))
- r = q.run()
-
- print(r)
-
- Getting results from database
-
- Total execution time: 24125 ms
-
- QueryID: a5de2545-2b5e-476c-9e92-b768d058f603
- Query: SELECT all_v1.* FROM (SELECT all_v1.* FROM gdc-bq-sample.integration.all_v2_1 AS all_v2_1, UNNEST(ResearchSubject) AS _ResearchSubject WHERE ((_ResearchSubject.associated_project = 'TCGA-BRCA') AND (all_v1.days_to_birth < -50*365))) AS all_v2_1, UNNEST(identifier) AS _identifier WHERE (_identifier.system = 'IDC')
- Offset: 0
- Count: 88
- Total Row Count: 88
- More pages: False