+ Coming soon: users will be able to suggest new vocabulary terms to the GDC via this link.
+
+
+
diff --git a/docs/Data_Dictionary/images/CDE_Data_Element_Details.png b/docs/Data_Dictionary/images/CDE_Data_Element_Details.png
new file mode 100644
index 000000000..c562a45ad
Binary files /dev/null and b/docs/Data_Dictionary/images/CDE_Data_Element_Details.png differ
diff --git a/docs/Data_Dictionary/images/CDE_Details.png b/docs/Data_Dictionary/images/CDE_Details.png
new file mode 100644
index 000000000..7cdf1a11a
Binary files /dev/null and b/docs/Data_Dictionary/images/CDE_Details.png differ
diff --git a/docs/Data_Dictionary/images/GDC_DD_Links.png b/docs/Data_Dictionary/images/GDC_DD_Links.png
new file mode 100644
index 000000000..042a79b80
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_DD_Links.png differ
diff --git a/docs/Data_Dictionary/images/GDC_DD_Properties_Boolean.png b/docs/Data_Dictionary/images/GDC_DD_Properties_Boolean.png
new file mode 100644
index 000000000..96d821475
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_DD_Properties_Boolean.png differ
diff --git a/docs/Data_Dictionary/images/GDC_DD_Properties_Enumeration.png b/docs/Data_Dictionary/images/GDC_DD_Properties_Enumeration.png
new file mode 100644
index 000000000..f222712cb
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_DD_Properties_Enumeration.png differ
diff --git a/docs/Data_Dictionary/images/GDC_DD_Properties_Integer.png b/docs/Data_Dictionary/images/GDC_DD_Properties_Integer.png
new file mode 100644
index 000000000..dd8305998
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_DD_Properties_Integer.png differ
diff --git a/docs/Data_Dictionary/images/GDC_DD_Properties_Number.png b/docs/Data_Dictionary/images/GDC_DD_Properties_Number.png
new file mode 100644
index 000000000..2b7980ffd
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_DD_Properties_Number.png differ
diff --git a/docs/Data_Dictionary/images/GDC_DD_Properties_String.png b/docs/Data_Dictionary/images/GDC_DD_Properties_String.png
new file mode 100644
index 000000000..614ee99b5
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_DD_Properties_String.png differ
diff --git a/docs/Data_Dictionary/images/GDC_DD_Title_and_Summary.png b/docs/Data_Dictionary/images/GDC_DD_Title_and_Summary.png
new file mode 100644
index 000000000..3f001b0b7
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_DD_Title_and_Summary.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search.png b/docs/Data_Dictionary/images/GDC_search.png
new file mode 100644
index 000000000..178f80808
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_exact_match_sample_type.png b/docs/Data_Dictionary/images/GDC_search_exact_match_sample_type.png
new file mode 100644
index 000000000..e5b14b97c
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_exact_match_sample_type.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_general_age.png b/docs/Data_Dictionary/images/GDC_search_general_age.png
new file mode 100644
index 000000000..8ae1d56f8
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_general_age.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_general_living.png b/docs/Data_Dictionary/images/GDC_search_general_living.png
new file mode 100644
index 000000000..f542e6d84
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_general_living.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_general_sample_type.png b/docs/Data_Dictionary/images/GDC_search_general_sample_type.png
new file mode 100644
index 000000000..63e0cac7f
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_general_sample_type.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_general_squamous_dictionary.png b/docs/Data_Dictionary/images/GDC_search_general_squamous_dictionary.png
new file mode 100644
index 000000000..318641170
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_general_squamous_dictionary.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_general_squamous_matched.png b/docs/Data_Dictionary/images/GDC_search_general_squamous_matched.png
new file mode 100644
index 000000000..c9147cc12
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_general_squamous_matched.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_general_squamous_see_all_values.png b/docs/Data_Dictionary/images/GDC_search_general_squamous_see_all_values.png
new file mode 100644
index 000000000..5db48b170
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_general_squamous_see_all_values.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_general_squamous_values.png b/docs/Data_Dictionary/images/GDC_search_general_squamous_values.png
new file mode 100644
index 000000000..d75bacf03
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_general_squamous_values.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_general_squamous_values_cadsr_values.png b/docs/Data_Dictionary/images/GDC_search_general_squamous_values_cadsr_values.png
new file mode 100644
index 000000000..b2a914234
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_general_squamous_values_cadsr_values.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_general_squamous_values_compare_list.png b/docs/Data_Dictionary/images/GDC_search_general_squamous_values_compare_list.png
new file mode 100644
index 000000000..8a53d4215
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_general_squamous_values_compare_list.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_general_squamous_values_terms.png b/docs/Data_Dictionary/images/GDC_search_general_squamous_values_terms.png
new file mode 100644
index 000000000..51644af3c
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_general_squamous_values_terms.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_property_description_age.png b/docs/Data_Dictionary/images/GDC_search_property_description_age.png
new file mode 100644
index 000000000..1b86df3ac
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_property_description_age.png differ
diff --git a/docs/Data_Dictionary/images/GDC_search_synonym_living.png b/docs/Data_Dictionary/images/GDC_search_synonym_living.png
new file mode 100644
index 000000000..bf7ea75c8
Binary files /dev/null and b/docs/Data_Dictionary/images/GDC_search_synonym_living.png differ
diff --git a/docs/Data_Dictionary/index.md b/docs/Data_Dictionary/index.md
index 38a60383d..60af70054 100644
--- a/docs/Data_Dictionary/index.md
+++ b/docs/Data_Dictionary/index.md
@@ -2,17 +2,150 @@
## Introduction
-The GDC Data Dictionary defines components of the [GDC Data Model](../Data/Data_Model/GDC_Data_Model.md) and relationships between them.
+The GDC Data Dictionary is a resource that describes the clinical, biospecimen, administrative, and genomic metadata that can be used in parallel with the genomic data generated by the GDC. The dictionary defines the structure of a database, the [data model](../Data/Data_Model/GDC_Data_Model.md), and the rules the data need to follow. In addition, the dictionary includes information about the relationships between entities within the data model.
+
+### Data Dictionary Components:
+
+The GDC Data Dictionary consists of the following components:
+
+* Comprehensive list of nodes, which represent entities in the data model and help to group metadata into categories.
+* Comprehensive list of properties in the database and their schemas, which describe specific data elements that can be submitted to the GDC.
+* Comprehensive list of unique keys and links between properties.
+* Constraints and requirements defined on nodes and properties, including acceptable values and data types.
+
+### Standards and Conventions
+
+All properties and values in the GDC Data Dictionary include references to external standards defined and maintained by the [NCI Thesaurus](https://ncit.nci.nih.gov/ncitbrowser/) (NCIt) and the [Cancer Data Standards Registry and Repository](https://wiki.nci.nih.gov/display/caDSR/caDSR+Wiki) (caDSR). Both of these standards are operated by groups at [NCI's Center for Bioinformatics and Information Technology](https://cbiit.cancer.gov/) (CBIIT).
+
+Each property is assigned a [Common Data Element](https://cdebrowser.nci.nih.gov/cdebrowserClient/cdeBrowser.html#/search) (CDE) created by the caDSR. The CDE provides detailed information about the property including links to the NCIt through assigned concept codes. NCIt concepts are also assigned at the permissible value level for enumerated properties. The images below are an example of a caDSR CDE and its related property-level NCIt concepts.
+
+[![CDE Data Elements Details](images/CDE_Data_Element_Details.png)](images/CDE_Data_Element_Details.png "Click to see the full image.")
+[![CDE Details](images/CDE_Details.png)](images/CDE_Details.png "Click to see the full image.")
+
+In addition to the caDSR and NCIt references, many of the properties are defined by additional standards including, but not limited to the following: [International Classification of Diseases](https://www.who.int/health-topics/international-classification-of-diseases) ([ICD-O-3](http://codes.iarc.fr/) and [ICD-10](https://www.cdc.gov/nchs/icd/icd10cm.htm)), [American Joint Committee on Cancer](https://cancerstaging.org/Pages/default.aspx) staging classifications, [Children's Oncology Group](https://www.childrensoncologygroup.org/) (COG) categorizations, and the [International Federation of Gynecology and Obstetrics](https://www.figo.org/) (FIGO) classifications. When these additional standards are used to describe a property, this is referenced in the description and the list of allowable values will reflect the criteria defined by the standard.
+
+Using external standards benefits both data contributors and data consumers at the GDC. For example, the curated lists of synonyms provided by NCIt allows for easy mapping of other study-specific clinical data standards to the GDC data dictionary. The available synonyms can be leveraged using the [GDC Data Dictionary Search](gdcmvs/).
## Data Dictionary Viewer
The [GDC Data Dictionary Viewer](viewer.md) is a user-friendly interface for accessing the dictionary. It includes the following functionality:
-* _Dictionary contents:_ Display of entities defined in the dictionary, including their descriptions, properties, and links.
-* _Links to semantic resources:_ Links to semantic data resources that define [Common Data Elements (CDEs)](http://cde.nih.gov) used in the dictionary
-* _Submission templates:_ Generation JSON and TSV templates for use in GDC data submission.
+* __Dictionary contents:__ Display of entities defined in the dictionary, including their descriptions, values or types, and links.
+* __Links to semantic resources:__ Links to semantic data resources that define [Common Data Elements (CDEs)](http://cde.nih.gov) used in the dictionary
+* __Submission templates:__ JSON and TSV template generation for use in GDC data submission.
+
+### Components of the Data Dictionary Viewer
+
+The sections below provide an example of the information available for each specific node in the GDC Data Dictionary.
+
+#### Summary
+
+[![Title and Summary](images/GDC_DD_Title_and_Summary.png)](images/GDC_DD_Title_and_Summary.png "Click to see the full image.")
+
+* __Type:__ The name of the node.
+* __Category:__ The type of metadata; some examples are Clinical, Biospecimen, Analysis and Submittable Data Files.
+* __Description:__ This section contains a written explanation for the type of data that would be found in this node.
+* __Unique Keys:__ The properties or list of properties that can be used to identify this node, and only this node, within the commons.
+
+This section also contains a "Download Template" link with a drop-down menu containing the two template file types: TSV and JSON. These files will contain all properties that are found in the node, but not all [properties are required](#properties) to upload the node.
+
+#### Links
+
+[![Links](images/GDC_DD_Links.png)](images/GDC_DD_Links.png "Click to see the full image.")
+
+* __Links to Entity:__ Other nodes that can be connected to the focal node.
+* __Link Name:__ A simplified stand in for the node link structure (requirement, target type, multiplicity, label). Its declaration categorizes the relationship between nodes.
+* __Relationship:__ The written description for the association between the focal node and the other connected node.
+* __Required:__ Displays whether the link to the node is required for the existence of the focal node. To link the focal node to a parent node, use the __.submitter_id__ with the value of that field set to the appropriate `submitter_id` in the parent node. For more information on creating links between nodes, please see the [Data Submission Walkthrough](Data_Submission_Portal/Users_Guide/Data_Submission_Walkthrough).
+
+#### Properties
+
+[![Properties Enumeration](images/GDC_DD_Properties_Enumeration.png)](images/GDC_DD_Properties_Enumeration.png "Click to see the full image.")
+[![Properties Integer](images/GDC_DD_Properties_Integer.png)](images/GDC_DD_Properties_Integer.png "Click to see the full image.")
+[![Properties Number](images/GDC_DD_Properties_Number.png)](images/GDC_DD_Properties_Number.png "Click to see the full image.")
+[![Properties String](images/GDC_DD_Properties_String.png)](images/GDC_DD_Properties_String.png "Click to see the full image.")
+[![Properties Boolean](images/GDC_DD_Properties_Boolean.png)](images/GDC_DD_Properties_Boolean.png "Click to see the full image.")
+
+* __Property:__ The name of the property.
+
+* __Description:__ The written explanation for the expected type and characterization of data found in this property.
+
+* __Acceptable Types or Values:__ The values that can be entered into the field based on the type category.
+ * Enumeration: A list of predetermined strings. The user must select the exact string from the list to be a valid entry. Case does matter. Many of these properties with enumerations have numerous values. To see all of the values, click the "More Values" link at the bottom of the property row under the __Acceptable Types or Values__ column.
+ * Integer: A field that only accepts whole numbers.
+ * Number: A field that can accept any number including numbers with decimal places.
+ * String: A field in which alphanumeric characters and `_`, `.`, `-`, up to a length of 32,767, can be entered. Do not use other characters as it will create submission errors. Some string fields contain regex restrictions to coerce data to a specific pattern.
+ * Boolean: A field that only accepts `true` or `false` as acceptable values. If these values are not entered as lowercase, the dictionary will not recognize the value and an error will occur.
+
+* __Required:__ This informs the user whether this field is necessary for the submission of the node. If information for a required field is unknown or not reported, there is often a value to reflect that missing information.
+
+* __CDE:__ The caDSR CDE Public ID, with the direct link to its respective Data Element Details page.
+
+## Search Tool
+
+The Search Tool enables easier query of the GDC Data Dictionary for data submitters and recommends GDC properties and values based on synonyms. Created by the NCI CBIIT EVS Team, it leverages NCI vocabulary systems caDSR and NCIt. Below are some of the features included in the Search Tool:
+
+* Users can complete partial or exact match searches.
+* Searches can include terms that are synonymous to the GDC allowable values.
+* Users can compare their list of values to the GDC allowable values.
+* Dictionary paths are described so users can find the specific node where a property is located.
+
+### Components of the Search Tool
+
+The sections below provide an example of the information available for each portion of the Search Tool.
+
+#### Search Tool Modifiers
+
+The Search Tool is equipped with the following modifiers to customize searches in the GDC Data Dictionary:
+
+[![GDC search](images/GDC_search.png)](images/GDC_search.png "Click to see the full image.")
+
+* __Exact match:__ This will return matches for only the exact value entered into the search field.
+
+ [![GDC search general sample type](images/GDC_search_general_sample_type.png)](images/GDC_search_general_sample_type.png "Click to see the full image.")
+ [![GDC search exact match sample type](images/GDC_search_exact_match_sample_type.png)](images/GDC_search_exact_match_sample_type.png "Click to see the full image.")
+
+* __Property description:__ This will return matches for the value found not only in the property, but also searches within the description of the property.
+
+ [![GDC search general age](images/GDC_search_general_age.png)](images/GDC_search_general_age.png "Click to see the full image.")
+ [![GDC search property description age](images/GDC_search_property_description_age.png)](images/GDC_search_property_description_age.png "Click to see the full image.")
+
+* __Synonyms:__ This will return matches that not only match the value entered, but other values that NCIt consider to be synonymous with the entered value.
+
+ [![GDC search general living](images/GDC_search_general_living.png)](images/GDC_search_general_living.png "Click to see the full image.")
+ [![GDC search synonym living](images/GDC_search_synonym_living.png)](images/GDC_search_synonym_living.png "Click to see the full image.")
+
+#### Result Fields
+
+The results from searches can be sorted into three different result fields:
+
+* __Values:__ This result section will return three columns that displays matches to values that are found in the GDC Data Dictionary:
+ [![GDC search general squamous values](images/GDC_search_general_squamous_values.png)](images/GDC_search_general_squamous_values.png "Click to see the full image.")
+ * __Category / Node / Property:__ This section displays the GDC Data Dictionary hierarchy that precedes the search term. This section can also contain information such as:
+ * __See All Values:__ This window will display all GDC values for this property.
+ [![GDC search general squamous see all values](images/GDC_search_general_squamous_see_all_values.png)](images/GDC_search_general_squamous_see_all_values.png "Click to see the full image.")
+ * __Compare with User List:__ This window allows the user to input a list of values to check against the acceptable values for that property.
+ [![GDC search general squamous see all values](images/GDC_search_general_squamous_values_compare_list.png)](images/GDC_search_general_squamous_values_compare_list.png "Click to see the full image.")
+ * __See All Terms:__ This window will display the NCIt code assigned to the specific term and the synonymous NCIt terms associated.
+ [![GDC search general squamous values terms](images/GDC_search_general_squamous_values_terms.png)](images/GDC_search_general_squamous_values_terms.png "Click to see the full image.")
+ * __caDSR: CDE , Values , Compare with GDC:__ This group of links can send the user to the CDE property page (CDE), opens a window that displays the caDSR values for that property (Values), or opens a window that compares the caDSR values with GDC values (Compare wth GDC).
+ [![GDC search general squamous values cadsr values](images/GDC_search_general_squamous_values_cadsr_values.png)](images/GDC_search_general_squamous_values_cadsr_values.png "Click to see the full image.")
+ * __Matched GDC Values:__ This column will display all GDC values that match the term with ICD-O-3 and NCIt values if they are available.
+ [![GDC search general squamous matched](images/GDC_search_general_squamous_matched.png)](images/GDC_search_general_squamous_matched.png "Click to see the full image.")
+ * __CDE Permissible Values:__ This column displays GDC dictionary properties that have corresponding caDSR clinical data elements (CDE).
+* __Properties:__ This result section will return five columns that displays matches to properites of the GDC Data Dictionary:
+ [![GDC search general age](images/GDC_search_general_age.png)](images/GDC_search_general_age.png "Click to see the full image.")
+ * __Category / Node:__ This column displays the Category and Node hierarchy for the search value.
+ * __Property:__ This column displays the name of the Property for the search value.
+ * __Description:__ This column displays the description for the returned property.
+ * __GDC Property Values:__ This column displays the value type for the returned property. For more information see the Acceptable Types or Values section under [Properties](#properties).
+ * __caDSR CDE Reference:__ This column displays the CDE link for the returned property.
+* __Dictionary:__ This result section will return two columns that display matches to values within the structure of the GDC Data Dictionary:
+ [![GDC search general squamous dictionary](images/GDC_search_general_squamous_dictionary.png)](images/GDC_search_general_squamous_dictionary.png "Click to see the full image.")
+ * __Name:__ This column displays the name of the Category, Node, or Property with a returned value total for each level.
+ * __Description:__ This column displays the GDC Data Dictionary description for each level.
-## Entity JSON Schemas
+## Data Dictionary API
In technical terms, the dictionary is a set of YAML files that define JSON schemas for each entity in the dictionary. The files are available [on GitHub](https://github.com/NCI-GDC/gdcdictionary/tree/develop/gdcdictionary/schemas).
diff --git a/docs/Data_Portal/Release_Notes/Data_Portal_Release_Notes.md b/docs/Data_Portal/Release_Notes/Data_Portal_Release_Notes.md
index 6b21a1cdd..daa45c5a5 100644
--- a/docs/Data_Portal/Release_Notes/Data_Portal_Release_Notes.md
+++ b/docs/Data_Portal/Release_Notes/Data_Portal_Release_Notes.md
@@ -2,6 +2,10 @@
| Version | Date |
|---|---|
+| [v1.20.0](Data_Portal_Release_Notes.md#release-1200) | April 17, 2019 |
+| [v1.19.0](Data_Portal_Release_Notes.md#release-1190) | February 20, 2019 |
+| [v1.18.0](Data_Portal_Release_Notes.md#release-1180) | December 18, 2018 |
+| [v1.17.0](Data_Portal_Release_Notes.md#release-1170) | November 7, 2018 |
| [v1.16.0](Data_Portal_Release_Notes.md#release-1160) | September 27, 2018 |
| [v1.15.0](Data_Portal_Release_Notes.md#release-1150) | August 23, 2018 |
| [v1.14.0](Data_Portal_Release_Notes.md#release-1140) | June 13, 2018 |
@@ -20,6 +24,142 @@
| [v1.0.1](Data_Portal_Release_Notes.md#release-101) | May 18, 2016 |
---
+## Release 1.20.0
+
+* __GDC Product__: GDC Data Portal
+* __Release Date__: April 17, 2019
+
+### New Features and Changes
+
+* Upgraded the Portal to use the latest React Javascript library (version 16.8)
+
+### Bugs Fixed Since Last Release
+
+* None
+
+### Known Issues and Workarounds
+
+* Pre-release Data Portal login is not supported on Internet Explorer or the last version of Edge (42). Edge 41 does login successfully.
+* Custom Facet Filters
+ * Some definitions are missing from the property list when adding custom facet file or case filters.
+* Visualizations
+ * SIFT and PolyPhen annotations are missing from the export JSON of the mutation table. They are present in the export TSV.
+ * Data Portal graphs cannot be exported as PNG images in Internet Explorer. Graphs can be exported in PNG or SVG format from Chrome or Firefox browsers . Internet Explorer does not display chart legend and title when re-opening previously downloaded SVG files, the recommendation is to open downloaded SVG files with another program.
+* Repository and Cart
+ * The annotation count in File table of Repository and Cart does not link to the Annotations page anymore. The user can navigate to the annotations through the annotation count in Repository - Case table.
+* Legacy Archive
+ * Downloading a token in the GDC Legacy Archive does not refresh it. If a user downloads a token in the GDC Data Portal and then attempts to download a token in the GDC Legacy Archive, an old token may be provided. Reloading the Legacy Archive view will allow the user to download the updated token.
+ * Exporting the Cart table in JSON will export the GDC Archive file table instead of exporting the files in the Cart only.
+* Web Browsers
+ * Browsers limit the number of concurrent downloads, it is generally recommended to add files to the cart and download large number of files through the GDC Data Transfer Tool, more details can be found on [GDC Website](https://gdc.cancer.gov/about-gdc/gdc-faqs).
+ * The GDC Portals are not compatible with Internet Explorer running in compatibility mode. Workaround is to disable compatibility mode.
+
+## Release 1.19.0
+
+* __GDC Product__: GDC Data Portal
+* __Release Date__: February 20, 2019
+
+### New Features and Changes
+
+* Added support for viewing of controlled-access mutations in the Data Portal
+* Added a new data access notification to remind logged-in users with access to controlled data that they need to follow their data use agreement. The message is fixed at the top of the Portal.
+* Added the ability to search for previous versions of files. If the user enters the UUID of a previous version that cannot be found, the Portal returns the UUID of the latest version available.
+* Renamed the Data Category for "Raw Sequencing Data" to "Sequencing Reads" throughout the portal where this appears, to be consistent with the Data Dictionary.
+* Added a link in the Portal footer to the GDC support page.
+
+### Bugs Fixed Since Last Release
+
+* Fixed bug where Survival Plot button never stops loading if plotting mutated vs. non-mutated cases for a single Gene.
+* Fixed inconsistent button styling when downloading controlled Downstream Analyses Files from File Entity page.
+* Removed unnecessary Survival column from Arrange Columns button on Case Entity, Gene Entity pages.
+* Removed unnecessary whitespace from pie charts on Repository page.
+* Added missing File Size unit to Clinical Supplement File, Biospecimen Supplement File tables on Case Entity page.
+* Fixed bug where clicking on Case Counts in Projects Graph tab was going to the Repository Files tab instead of the Cases tab.
+* Fixed bug where the counts shown beside customer filters on the Repository Cases tab were not updating when filtering on other facets.
+* Fixed bug where clicking the # of Affected Cases denominator on the Gene page's Most Frequent Somatic Mutations table displayed an incorrect number of Cases.
+
+### Known Issues and Workarounds
+
+* Pre-release Data Portal login is not supported on Internet Explorer or the last version of Edge (42). Edge 41 does login successfully.
+* Custom Facet Filters
+ * Some definitions are missing from the property list when adding custom facet file or case filters.
+* Visualizations
+ * SIFT and PolyPhen annotations are missing from the export JSON of the mutation table. They are present in the export TSV.
+ * Data Portal graphs cannot be exported as PNG images in Internet Explorer. Graphs can be exported in PNG or SVG format from Chrome or Firefox browsers . Internet Explorer does not display chart legend and title when re-opening previously downloaded SVG files, the recommendation is to open downloaded SVG files with another program.
+* Repository and Cart
+ * The annotation count in File table of Repository and Cart does not link to the Annotations page anymore. The user can navigate to the annotations through the annotation count in Repository - Case table.
+* Legacy Archive
+ * Downloading a token in the GDC Legacy Archive does not refresh it. If a user downloads a token in the GDC Data Portal and then attempts to download a token in the GDC Legacy Archive, an old token may be provided. Reloading the Legacy Archive view will allow the user to download the updated token.
+ * Exporting the Cart table in JSON will export the GDC Archive file table instead of exporting the files in the Cart only.
+* Web Browsers
+ * Browsers limit the number of concurrent downloads, it is generally recommended to add files to the cart and download large number of files through the GDC Data Transfer Tool, more details can be found on [GDC Website](https://gdc.cancer.gov/about-gdc/gdc-faqs).
+ * The GDC Portals are not compatible with Internet Explorer running in compatibility mode. Workaround is to disable compatibility mode.
+
+## Release 1.18.0
+
+* __GDC Product__: GDC Data Portal
+* __Release Date__: December 18, 2018
+
+### New Features and Changes
+
+* A new data access message has been added when downloading controlled data. Users must agree to abide by data access control policies when downloading controlled data.
+* In the Mutation free-text search in Exploration, mutation display now includes the UUID, genomic location, and matched search term for easier mutation searching.
+* The ability to sort on ranked columns has been made available.
+
+### Bugs Fixed Since Last Release
+
+* In some cases, text was being cut off on the Project page visualization tab. Text is no longer cut off.
+* HGNC link on Gene page broke as the source format url changed; The format was updated and the link is now functional
+* In the biospecimen details on the Case page, the cart icon would disappear once clicked. It now is always visible.
+
+### Known Issues and Workarounds
+
+* Pre-release Data Portal login is not supported on Internet Explorer or the last version of Edge (42). Edge 41 does login successfully.
+* Custom Facet Filters
+ * Some definitions are missing from the property list when adding custom facet file or case filters.
+* Visualizations
+ * SIFT and PolyPhen annotations are missing from the export JSON of the mutation table. They are present in the export TSV.
+ * Data Portal graphs cannot be exported as PNG images in Internet Explorer. Graphs can be exported in PNG or SVG format from Chrome or Firefox browsers . Internet Explorer does not display chart legend and title when re-opening previously downloaded SVG files, the recommendation is to open downloaded SVG files with another program.
+* Repository and Cart
+ * The annotation count in File table of Repository and Cart does not link to the Annotations page anymore. The user can navigate to the annotations through the annotation count in Repository - Case table.
+* Legacy Archive
+ * Downloading a token in the GDC Legacy Archive does not refresh it. If a user downloads a token in the GDC Data Portal and then attempts to download a token in the GDC Legacy Archive, an old token may be provided. Reloading the Legacy Archive view will allow the user to download the updated token.
+ * Exporting the Cart table in JSON will export the GDC Archive file table instead of exporting the files in the Cart only.
+* Web Browsers
+ * Browsers limit the number of concurrent downloads, it is generally recommended to add files to the cart and download large number of files through the GDC Data Transfer Tool, more details can be found on [GDC Website](https://gdc.cancer.gov/about-gdc/gdc-faqs).
+ * The GDC Portals are not compatible with Internet Explorer running in compatibility mode. Workaround is to disable compatibility mode.
+
+## Release 1.17.0
+
+* __GDC Product__: GDC Data Portal
+* __Release Date__: November 7, 2018
+
+### New Features and Changes
+
+* Copy Number Variation (CNV) data derived from GISTIC results are now available in the portal:
+ * View number of CNV events on a gene in a cohort in the Explore Gene table tab
+ * Explore CNVs associated with a gene on the Gene Entity Page
+ * Explore CNVs concurrently with mutations on the Oncogrid with new visualization
+
+### Bugs Fixed Since Last Release
+
+* None
+
+### Known Issues and Workarounds
+
+* Custom Facet Filters
+ * Some definitions are missing from the property list when adding custom facet file or case filters.
+* Visualizations
+ * SIFT and PolyPhen annotations are missing from the export JSON of the mutation table. They are present in the export TSV.
+ * Data Portal graphs cannot be exported as PNG images in Internet Explorer. Graphs can be exported in PNG or SVG format from Chrome or Firefox browsers . Internet Explorer does not display chart legend and title when re-opening previously downloaded SVG files, the recommendation is to open downloaded SVG files with another program.
+* Repository and Cart
+ * The annotation count in File table of Repository and Cart does not link to the Annotations page anymore. The user can navigate to the annotations through the annotation count in Repository - Case table.
+* Legacy Archive
+ * Downloading a token in the GDC Legacy Archive does not refresh it. If a user downloads a token in the GDC Data Portal and then attempts to download a token in the GDC Legacy Archive, an old token may be provided. Reloading the Legacy Archive view will allow the user to download the updated token.
+ * Exporting the Cart table in JSON will export the GDC Archive file table instead of exporting the files in the Cart only.
+* Web Browsers
+ * Browsers limit the number of concurrent downloads, it is generally recommended to add files to the cart and download large number of files through the GDC Data Transfer Tool, more details can be found on [GDC Website](https://gdc.cancer.gov/about-gdc/gdc-faqs).
+ * The GDC Portals are not compatible with Internet Explorer running in compatibility mode. Workaround is to disable compatibility mode.
## Release 1.16.0
diff --git a/docs/Data_Portal/Users_Guide/Advanced_Search.md b/docs/Data_Portal/Users_Guide/Advanced_Search.md
index 7a9c676f2..5f3b23844 100644
--- a/docs/Data_Portal/Users_Guide/Advanced_Search.md
+++ b/docs/Data_Portal/Users_Guide/Advanced_Search.md
@@ -7,7 +7,7 @@ Only available in the Repository view, the Advanced Search page offers complex q
## Overview: GQL
-Advanced search allows, via Genomic Query Language (GQL), to use structured queries to search for files and cases.
+Advanced search allows for structured queries to search for files and cases. This is done via Genomic Query Language (GQL), a query language created by the [GDC](https://gdc.cancer.gov/) and [OICR](https://oicr.on.ca/).
[![Advanced Search View](images/gdc-data-portal-advanced-search.png)](images/gdc-data-portal-advanced-search.png "Click to see the full image.")
@@ -17,7 +17,7 @@ A simple query in GQL (also known as a 'clause') consists of a __field__, follow
Note that it is not possible to compare two fields (e.g. disease_type = project.name).
-__Note__: GQL is not a database query language. For example, GQL does not have a "SELECT" statement.
+> __Note:__ GQL is not a database query language. For example, GQL does not have a "SELECT" statement.
### Switching between Advanced Search and Facet Filters
@@ -27,7 +27,7 @@ A query created in Advanced Search is not translated back to facet filters. Clic
## Using the Advanced Search
-When opening the advanced search page (via the Repository view), the search field will be automatically populated with facets filters already applied (if any).
+When opening the Advanced Search Page (via the Repository View), the search field will be automatically populated with facets filters already applied (if any).
This default query can be removed by pressing "Reset".
@@ -37,7 +37,7 @@ Once the query has been entered and is identified as a "Valid Query", click on "
As a query is being written, the GDC Data Portal will analyze the context and offer a list of auto-complete suggestions. Auto-complete suggests both fields and values as described below.
-#### Field Auto-complete
+### Field Auto-complete
The list of auto-complete suggestions includes __all__ available fields matching the user text input. The user has to scroll down to see more fields in the dropdown:
@@ -51,17 +51,17 @@ The value auto-complete is not aware of the general context of the query, the sy
[![Value Auto-complete](images/gdc-data-portal-advanced-search-value.png)](images/gdc-data-portal-advanced-search-value.png "Click to see the full image.")
-__Note__: Quotes are automatically added to the value if it contains spaces.
+> __Note:__ Quotes are automatically added to the value if it contains spaces.
## Setting Precedence of Operators
You can use parentheses in complex GQL statements to enforce the precedence of operators.
-For example, if you want to find all the open files in TCGA program as well as the files in TARGET program, you can use parentheses to enforce the precedence of the boolean operators in your query, i.e.:
+For example, if you want to find all the open files in TCGA program as well as the files in TARGET program, you can use parentheses to enforce the precedence of the Boolean operators in your query, i.e.:
(files.access = open and cases.project.program.name = TCGA) or cases.project.program.name = TARGET
-__Note__: Without parentheses, the statement will be evaluated left-to-right.
+> __Note:__ Without parentheses, the statement will be evaluated left-to-right.
## Keywords
@@ -69,42 +69,42 @@ A GQL keyword is a word that joins two or more clauses together to form a comple
**List of Keywords:**
-* AND
-* OR
+* __AND__
+* __OR__
-__Note__: parentheses can be used to control the order in which clauses are executed.
+> __Note:__ Parentheses can be used to control the order in which clauses are executed.
-### AND Keyword
+### "__AND__" Keyword
Used to combine multiple clauses, allowing you to refine your search.
Examples:
-* Find all open files in breast cancer
+* Find all open files in breast cancer:
- cases.project.primary_site = Breast and files.access = open
+ cases.primary_site = Breast and files.access = open
-* Find all open files in breast cancer and data type is copy number variation
+* Find all open files in breast cancer and data type is gene expression quantification:
- cases.project.primary_site = Breast and files.access = open and files.data_type = "Copy number variation"
+ cases.primary_site = Breast and files.access = open and files.data_type = "Gene Expression Quantification"
-### OR Keyword
+### "__OR__" Keyword
Used to combine multiple clauses, allowing you to expand your search.
-__Note__: __IN__ keyword can be an alternative to OR and result in simplified queries.
+> __Note:__ The __IN__ keyword can be an alternative to __OR__ and result in simplified queries.
Examples:
-* Find all files that are raw sequencing data or raw microarray data:
+* Find all files that are raw sequencing data or aligned reads:
- files.data_type = "Raw microarray data" or files.data_type = "Raw sequencing data"
+ files.data_type = "Aligned Reads" or files.data_type = "Raw sequencing data"
-* Find all files where donors are male or vital status is alive:
+* Find all files where cases are male or vital status is alive:
- cases.demographic.gender = male or cases.diagnoses.vital_status = alive
+ cases.demographic.gender = male or cases.diagnoses.vital_status = alive
## Operators
@@ -127,136 +127,154 @@ An operator in GQL is one or more symbols or words comparing the value of a fiel
| NOT MISSING | Field NOT MISSING |
-### "=" operator - EQUAL
+### "__=__" Operator - __EQUAL__
-The "=" operator is used to search for files where the value of the specified field exactly matches the specified value.
+The "__=__" operator is used to search for files where the value of the specified field exactly matches the specified value.
Examples:
-* Find all files that are gene expression:
+* Find all files that are gene expression quantification:
- files.data_type = "Gene expression"
+ files.data_type = "Gene Expression Quantification"
* Find all cases whose gender is female:
- cases.demographic.gender = female
+ cases.demographic.gender = female
-### "!=" operator - NOT EQUAL
+### "__!=__" Operator - __NOT EQUAL__
-The "!=" operator is used to search for files where the value of the specified field does not match the specified value.
+The "__!=__" operator is used to search for files where the value of the specified field does not match the specified value.
-The "!=" operator will not match a field that has no value (i.e. a field that is empty). For example, 'gender != male' will only match cases who have a gender and the gender is not male. To find cases other than male or with no gender populated, you would need to type gender != male or gender is missing.
+The "__!=__" operator will not match a field that has no value (i.e. a field that is empty). For example:
+
+ cases.demographic.gender != male
+
+This search will only match cases who have a gender and the gender is not male. To find cases other than male or with no gender populated, you would need to search:
+
+ cases.demographic.gender != male or cases.demographic.gender is missing.
Example:
-* Find all files with an experimental different from genotyping array:
+* Find all files with an experimental strategy that is not genotyping array:
- files.experimental_strategy != "Genotyping array"
+ files.experimental_strategy != "Genotyping array"
-### ">" operator - GREATER THAN
+### "__>__" Operator - __GREATER THAN__
-The ">" operator is used to search for files where the value of the specified field is greater than the specified value.
+The "__>__" operator is used to search for files where the value of the specified field is greater than the specified value.
Example:
* Find all cases whose number of days to death is greater than 60:
- cases.diagnoses.days_to_death > 60
+ cases.diagnoses.days_to_death > 60
-### ">=" operator - GREATER THAN OR EQUALS
+### "__>=__" Operator - __GREATER THAN OR EQUALS__
-The ">=" operator is used to search for files where the value of the specified field is greater than or equal to the specified value.
+The "__>=__" operator is used to search for files where the value of the specified field is greater than or equal to the specified value.
Example:
* Find all cases whose number of days to death is equal or greater than 60:
- cases.diagnoses.days_to_death >= 60
+ cases.diagnoses.days_to_death >= 60
-### "<" operator - LESS THAN
+### "__<__" Operator - __LESS THAN__
-The "<" operator is used to search for files where the value of the specified field is less than the specified value.
+The "__<__" operator is used to search for files where the value of the specified field is less than the specified value.
Example:
* Find all cases whose age at diagnosis is less than 400 days:
- cases.diagnoses.age_at_diagnosis < 400
+ cases.diagnoses.age_at_diagnosis < 400
-### "<=" operator - LESS THAN OR EQUALS
+### "__<=__" Operator - __LESS THAN OR EQUALS__
-The "<=" operator is used to search for files where the value of the specified field is less than or equal to the specified value.
+The "__<=__" operator is used to search for files where the value of the specified field is less than or equal to the specified value.
Example:
* Find all cases with a number of days to death less than or equal to 20:
- cases.diagnoses.days_to_death <= 20
+ cases.diagnoses.days_to_death <= 20
-### "IN" Operator
+### "__IN__" Operator
-The "IN" operator is used to search for files where the value of the specified field is one of multiple specified values. The values are specified as a comma-delimited list, surrounded by brackets [ ].
+The "__IN__" operator is used to search for files where the value of the specified field is one of multiple specified values. The values are specified as a comma-delimited list, surrounded by brackets [ ].
-Using "IN" is equivalent to using multiple 'EQUALS (=)' statements, but is shorter and more convenient. That is, typing 'project IN [ProjectA, ProjectB, ProjectC]' is the same as typing 'project = "ProjectA" OR project = "ProjectB" OR project = "ProjectC"'.
+Using "__IN__" is equivalent to using multiple "__=__" (__EQUALS__) statements, but is shorter and more convenient. That is, these two following statement will retrieve the same output:
+
+ cases.project.name IN [ProjectA, ProjectB, ProjectC]
+ cases.project.name = "ProjectA" OR cases.project.name = "ProjectB" OR cases.project.name = "ProjectC"
Examples:
-* Find all files in breast, breast and lung and cancer:
+* Find all files in breast, brain, and lung cancer:
+
+ cases.primary_site IN [Breast, Brain, Lung]
+
+* Find all files that are annotated somactic mutations or raw simple somatic mutations:
+
+ files.data_type IN ["Annotated Somatic Mutation", "Raw Simple Somatic Mutation"]
+
- cases.project.primary_site IN [Brain, Breast,Lung]
+### "__EXCLUDE__" Operator
-* Find all files tagged with exon or junction or hg19:
+The "__EXCLUDE__" operator is used to search for files where the value of the specified field is not one of multiple specified values.
- files.data_type IN ["Aligned reads", "Unaligned reads"]
+Using "__EXCLUDE__" is equivalent to using multiple "__!=__" (__NOT_EQUALS__) statements, but is shorter and more convenient. That is, these two following statement will retrieve the same output:
+ cases.project.name EXCLUDE [ProjectA, ProjectB, ProjectC]
+ cases.project.name != "ProjectA" OR cases.project.name != "ProjectB" OR cases.project.name != "ProjectC"
-### "EXCLUDE" Operator
+The "__EXCLUDE__" operator will not match a field that has no value (i.e. a field that is empty). For example:
-The "EXCLUDE" operator is used to search for files where the value of the specified field is not one of multiple specified values.
+ files.experimental_strategy EXCLUDE ["WGS","WXS"]
-Using "EXCLUDE" is equivalent to using multiple 'NOT_EQUALS (!=)' statements, but is shorter and more convenient. That is, typing 'project EXCLUDE [ProjectA, ProjectB, ProjectC]' is the same as typing 'project != "ProjectA" OR project != "ProjectB" OR project != "ProjectC"'
+This search will only match files that have an experimental strategy **and** the experimental strategy is not "WGS" or "WXS". To find files with an experimental strategy different than "WGS" or "WXS" **or is not assigned**, you would need to type:
-The "EXCLUDE" operator will not match a field that has no value (i.e. a field that is empty). For example, 'experimental strategy EXCLUDE ["WGS","WXS"]' will only match files that have an experimental strategy **and** the experimental strategy is not "WGS" or "WXS". To find files with an experimental strategy different from than "WGS" or "WXS" **or is not assigned**, you would need to type: files.experimental_strategy in ["WXS","WGS"] or files.experimental_strategy is missing.
+ files.experimental_strategy in ["WXS","WGS"] or files.experimental_strategy is missing
Examples:
* Find all files where experimental strategy is not WXS, WGS, Genotyping array:
- files.experimental_strategy EXCLUDE [WXS, WGS, "Genotyping array"]
+ files.experimental_strategy EXCLUDE [WXS, WGS, "Genotyping array"]
-### "IS MISSING" Operator
+### "__IS MISSING__" Operator
-The "IS" operator can only be used with "MISSING". That is, it is used to search for files where the specified field has no value.
+The "__IS__" operator can only be used with "__MISSING__". That is, it is used to search for files where the specified field has no value.
Examples:
* Find all cases where gender is missing:
- cases.demographic.gender is MISSING
+ cases.demographic.gender is MISSING
-### "NOT MISSING" Operator
+### "__NOT MISSING__" Operator
-The "NOT" operator can only be used with "MISSING". That is, it is used to search for files where the specified field has a value.
+The "__NOT__" operator can only be used with "__MISSING__". That is, it is used to search for files where the specified field has a value.
Examples:
* Find all cases where race is not missing:
- cases.demographic.race NOT MISSING
+ cases.demographic.race NOT MISSING
## Special Cases
-### Date format
+### Date Format
The date format should be the following: **YYYY-MM-DD** (without quotes).
Example:
- files.updated_datetime > 2015-12-31
+ files.updated_datetime > 2015-12-31
### Using Quotes
@@ -265,9 +283,9 @@ A value must be quoted if it contains a space. Otherwise the advanced search wil
Quotes are not necessary if the value consists of one single word.
-* Example: Find all cases with primary site is brain and data type is copy number variation:
+* Example: Find all cases with primary site is brain and data type is copy number segment:
- cases.project.primary_site = Brain and files.data_type = "Copy number variation"
+ cases.primary_site = Brain and files.data_type = "Copy Number Segment"
### Age at Diagnosis - Unit in Days
@@ -277,7 +295,7 @@ The __conversion factor__ is 1 year = 365.25 days
* Example: Find all cases whose age at diagnosis > 40 years old (40 * 365.25)
- cases.diagnoses.age_at_diagnosis > 14610
+ cases.diagnoses.age_at_diagnosis > 14610
@@ -285,119 +303,4 @@ The __conversion factor__ is 1 year = 365.25 days
The full list of fields available on the GDC Data Portal can be found through the GDC API using the following endpoint:
-[https://api.gdc.cancer.gov/gql/_mapping](https://api.gdc.cancer.gov/gql/_mapping)
-
-Alternatively, a static list of fields is available below (not exhaustive).
-
-### Files
-
-+ files.access
-+ files.acl
-+ files.archive.archive_id
-+ files.archive.revision
-+ files.archive.submitter_id
-+ files.center.center_id
-+ files.center.center_type
-+ files.center.code
-+ files.center.name
-+ files.center.namespace
-+ files.center.short_name
-+ files.data_format
-+ files.data_subtype
-+ files.data_type
-+ files.experimental_strategy
-+ files.file_id
-+ files.file_name
-+ files.file_size
-+ files.md5sum
-+ files.origin
-+ files.platform
-+ files.related_files.file_id
-+ files.related_files.file_name
-+ files.related_files.md5sum
-+ files.related_files.type
-+ files.state
-+ files.state_comment
-+ files.submitter_id
-+ files.tags
-
-### Cases
-
-+ cases.case_id
-+ cases.submitter_id
-+ cases.diagnoses.age_at_diagnosis
-+ cases.diagnoses.days_to_death
-+ cases.demographic.ethnicity
-+ cases.demographic.gender
-+ cases.demographic.race
-+ cases.diagnoses.vital_status
-+ cases.project.disease_type
-+ cases.project.name
-+ cases.project.program.name
-+ cases.project.program.program_id
-+ cases.project.project_id
-+ cases.project.state
-+ cases.samples.sample_id
-+ cases.samples.submitter_id
-+ cases.samples.sample_type
-+ cases.samples.sample_type_id
-+ cases.samples.shortest_dimension
-+ cases.samples.time_between_clamping_and_freezing
-+ cases.samples.time_between_excision_and_freezing
-+ cases.samples.tumor_code
-+ cases.samples.tumor_code_id
-+ cases.samples.current_weight
-+ cases.samples.days_to_collection
-+ cases.samples.days_to_sample_procurement
-+ cases.samples.freezing_method
-+ cases.samples.initial_weight
-+ cases.samples.intermediate_dimension
-+ cases.samples.is_ffpe
-+ cases.samples.longest_dimension
-+ cases.samples.oct_embedded
-+ cases.samples.pathology_report_uuid
-+ cases.samples.portions.analytes.a260_a280_ratio
-+ cases.samples.portions.analytes.aliquots.aliquot_id
-+ cases.samples.portions.analytes.aliquots.amount
-+ cases.samples.portions.analytes.aliquots.center.center_id
-+ cases.samples.portions.analytes.aliquots.center.center_type
-+ cases.samples.portions.analytes.aliquots.center.code
-+ cases.samples.portions.analytes.aliquots.center.name
-+ cases.samples.portions.analytes.aliquots.center.namespace
-+ cases.samples.portions.analytes.aliquots.center.short_name
-+ cases.samples.portions.analytes.aliquots.concentration
-+ cases.samples.portions.analytes.aliquots.source_center
-+ cases.samples.portions.analytes.aliquots.submitter_id
-+ cases.samples.portions.analytes.amount
-+ cases.samples.portions.analytes.analyte_id
-+ cases.samples.portions.analytes.analyte_type
-+ cases.samples.portions.analytes.concentration
-+ cases.samples.portions.analytes.spectrophotometer_method
-+ cases.samples.portions.analytes.submitter_id
-+ cases.samples.portions.analytes.well_number
-+ cases.samples.portions.center.center_id
-+ cases.samples.portions.center.center_type
-+ cases.samples.portions.center.code
-+ cases.samples.portions.center.name
-+ cases.samples.portions.center.namespace
-+ cases.samples.portions.center.short_name
-+ cases.samples.portions.is_ffpe
-+ cases.samples.portions.portion_id
-+ cases.samples.portions.portion_number
-+ cases.samples.portions.slides.number_proliferating_cells
-+ cases.samples.portions.slides.percent_eosinophil_infiltration
-+ cases.samples.portions.slides.percent_granulocyte_infiltration
-+ cases.samples.portions.slides.percent_inflam_infiltration
-+ cases.samples.portions.slides.percent_lymphocyte_infiltration
-+ cases.samples.portions.slides.percent_monocyte_infiltration
-+ cases.samples.portions.slides.percent_necrosis
-+ cases.samples.portions.slides.percent_neutrophil_infiltration
-+ cases.samples.portions.slides.percent_normal_cells
-+ cases.samples.portions.slides.percent_stromal_cells
-+ cases.samples.portions.slides.percent_tumor_cells
-+ cases.samples.portions.slides.percent_tumor_nuclei
-+ cases.samples.portions.slides.section_location
-+ cases.samples.portions.slides.slide_id
-+ cases.samples.portions.slides.submitter_id
-+ cases.samples.portions.submitter_id
-+ cases.samples.portions.weight
+[https://api.gdc.cancer.gov/gql/_mapping](https://api.gdc.cancer.gov/gql/_mapping)
\ No newline at end of file
diff --git a/docs/Data_Portal/Users_Guide/Cart.md b/docs/Data_Portal/Users_Guide/Cart.md
index f24070cb7..c7b1c840d 100644
--- a/docs/Data_Portal/Users_Guide/Cart.md
+++ b/docs/Data_Portal/Users_Guide/Cart.md
@@ -1,62 +1,62 @@
# Cart and File Download
-## Overview
-
-While browsing the GDC Data Portal, files can either be downloaded individually from [file detail pages](Repository.md#file-summary-page) or collected in the file cart to be downloaded as a bundle. Clicking on the shopping cart icon that is next to any item in the GDC will add the item to your cart.
+While browsing the GDC Data Portal, files can either be downloaded individually from [File Summary Pages](Repository.md#file-summary-page) or collected in the file cart to be downloaded as a bundle. Clicking on the shopping cart icon that is next to any item in the GDC will add the item to your cart.
## GDC Cart
[![Cart](images/cart-overview_v2.png)](images/cart-overview_v2.png "Click to see the full image.")
-### Cart Summary
+## Cart Summary
-The cart page shows a summary of all files currently in the cart:
+The Cart Summary Page shows a summary of all files currently in the cart:
-* Number of files
-* Number of cases associated with the files
-* Total file size
+* Number of files.
+* Number of cases associated with the files.
+* Total file size.
The Cart page also displays two tables:
-* __File count by project__: Breaks down the files and cases by each project
-* __File count by authorization level__: Breaks down the files in the cart by authorization level. A user must be logged into the GDC in order to download 'Controlled-Access files'
+* __File count by project__: Breaks down the files and cases by each project.
+* __File count by authorization level__: Breaks down the files in the cart by authorization level. A user must be logged into the GDC in order to download 'Controlled-Access files'.
-The cart also directs users how to download files in the cart. For large data files, it is recommended that the GDC Data Transfer Tool be used.
+The cart also directs users how to download files in the cart. For large data files, it is recommended that the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool) be used.
-### Cart Items
+## Cart Items
[![Cart](images/gdc-cart-items_v2.png)](images/gdc-cart-items_v2.png "Click to see the full image.")
The Cart Items table shows the list of all the files that were added to the Cart. The table gives the folowing information for each file in the cart:
* __Access__: Displays whether the file is open or controlled access. Users must login to the GDC Portal and have the appropriate credentials to access these files.
-* __File Name__: Name of the file. Clicking the link will bring the user to the file summary page.
-* __Cases__: How many cases does the file contain. Clicking the link will bring the user to the case summary page.
-* __Project__: The Project that the file belongs to. Clicking the link will bring the user to the Project summary page.
-* __Category__: Type of data
-* __Format__: The file format
-* __Size__: The size of the file
-* __Annotations__: Whether there are any annotations
+* __File Name__: Name of the file. Clicking the link will bring the user to the [File Summary Page](#file-summary-page).
+* __Cases__: How many cases does the file contain. Clicking the link will bring the user to the [Case Summary Page](Exploration.md#case-summary-page).
+* __Project__: The Project that the file belongs to. Clicking the link will bring the user to the [Project Summary Page](Projects.md#project-summary-page).
+* __Category__: Type of data.
+* __Format__: The file format.
+* __Size__: The size of the file.
+* __Annotations__: Whether there are any annotations.
-## Download Options
+# Download Options
[![Cart](images/gdc-download-options_v2.png)](images/gdc-download-options_v2.png "Click to see the full image.")
-There are a few buttons on the Cart page that allow users to download files. The following download options are available:
+The following buttons on the Cart page allows users to download files that are related to the ones in the cart. The following download options are available:
-* __Biospecimen__: Downloads bioscpecimen data related to files in the cart in either TSV or JSON format.
-* __Clinical__: Downloads clinical data related to files in the cart in either TSV or JSON format.
-* __Sample Sheet__: Downloads a tab-separated file which contains the associated case/sample IDs and sample type for each file in the cart.
-* __Metadata__: GDC harmonized clinical, biospecimen, and file metadata associated with the files in the cart.
-* __Download Manifest__: Download a manifest file for use with the GDC Data Transfer Tool to download files. A manifest file contains a list of the UUIDs that correspond to the files in the cart.
-* __Download Cart__: Download the files in the Cart directly through the browser. Users have to be cautious of the amount of data in the cart since this option will not optimize bandwidth and will not provide resume capabilities.
-* __SRA XML, MAGE-TAB__: This option is available in the GDC Legacy Archive only. It is used to download metadata files associated with the files in the cart.
+* __Biospecimen:__ Downloads biospecimen data related to files in the cart in either TSV or JSON format.
+* __Clinical:__ Downloads clinical data related to files in the cart in either TSV or JSON format.
+* __Sample Sheet:__ Downloads a tab-separated file which contains the associated case/sample IDs and the sample type (Tumor/Normal) for each file in the cart.
+* __Metadata:__ GDC harmonized clinical, biospecimen, and file metadata associated with the files in the cart.
+* __Download:__
+ * __Manifest:__ Download a manifest file for use with the GDC Data Transfer Tool to download files. A manifest file contains a list of the UUIDs that correspond to the files in the cart.
+ * __Cart:__ Download the files in the Cart directly through the browser. Users have to be cautious of the amount of data in the cart since this option will not optimize bandwidth and will not provide resume capabilities.
+* __Remove from Cart:__ Remove all files or unauthorized files from the cart.
+* __SRA XML, MAGE-TAB:__ This option is available in the GDC Legacy Archive only. It is used to download metadata files associated with the files in the cart.
-The cart allows users to download up to 5 GB of data directly through the web browser. This is not recommended for downloading large volumes of data, in particular due to the absence of a retry/resume mechanism. For downloads over 5 GB we recommend using the GDC Data Transfer Tool.
+The cart allows users to download up to 5 GB of data directly through the web browser. This is not recommended for downloading large volumes of data, in particular due to the absence of a retry/resume mechanism. For downloads over 5 GB we recommend using the `Download Manifest` button and download a manifest file that can be imported into [GDC Data Transfer Tool](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Getting_Started/).
-__Note__: when downloading multiple files from the cart, they are automatically bundled into one single Gzipped (.tar.gz) file.
+>__Note__: when downloading multiple files from the cart, they are automatically bundled into one single Gzipped (.tar.gz) file.
-### GDC Data Transfer Tool
+## [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool)
The `Download Manifest` button will download a manifest file that can be imported into the GDC Data Transfer Tool. Below is an example of the contents of a manifest file used for download:
@@ -73,14 +73,69 @@ c57673ac-998a-4a50-a12b-4cac5dc3b72e mdanderson.org_KIRP.MDA_RPPA_Core.mage-tab.
The Manifest contains a list of the file UUIDs in the cart and can be used together with the GDC Data Transfer Tool to download all files.
-Information on the GDC Data Transfer Tool is available in the [GDC Data Transfer Tool User's Guide](/node/8196/).
+Information on the GDC Data Transfer Tool is available in the [GDC Data Transfer Tool User's Guide](../../Data_Transfer_Tool/Users_Guide/Getting_Started.md).
+
+# Controlled Files
+
+If a user tries to download a cart containing controlled files and without being authenticated, a pop-up will be displayed to offer the user either to download only open access files or to login into the GDC Data Portal through eRA Commons. See [Authentication](#Authentication) for details.
+
+Once a user is logged in, controlled files that they have access to can be downloaded. To download files from the portal, users must agree to the GDC and individual project Data Use Agreements by selecting the agreement checkbox on the Access Alert message.
+
+[![Cart Page](images/gdc-data-portal-download-cart_v2.png)](images/gdc-data-portal-download-cart_v2.png "Click to see the full image.")
+
+# Authentication
+
+The GDC Data Portal provides granular metadata for all datasets available in the GDC. Any user can see a listing of all available data files, including controlled-access files. The GDC Data Portal also allows users to download open-access files without logging in. However, downloading of controlled-access files is restricted to authorized users and requires authentication.
+
+## Logging into the GDC
+
+To login to the GDC, users must click on the `Login` button on the top right of the GDC Website.
+
+![Login](images/gdc-login.png)
+
+After clicking Login, users authenticate themselves using their eRA Commons login and password. If authentication is successful, the eRA Commons username will be displayed in the upper right corner of the screen, in place of the "Login" button.
+
+Upon successful authentication, GDC Data Portal users can:
+
+- See which controlled-access files they can access.
+- Download controlled-access files directly from the GDC Data Portal.
+- Download an authentication token for use with the GDC Data Transfer Tool or the GDC API.
+- See controlled-access mutation data they can access.
+
+Controlled-access files are identified using a "lock" icon:
+
+[![GDC Data Portal Main Page](images/gdc-data-portal-controlled-files.png)](images/gdc-data-portal-controlled-files.png "Click to see the full image.")
+
+The rest of this section describes controlled data access features of the GDC Data Portal available to authorized users. For more information about open and controlled-access data, and about obtaining access to controlled data, see [Data Access Processes and Tools](https://gdc.cancer.gov/access-data/data-access-processes-and-tools).
+
+## User Profile
+
+After logging into the GDC Portal, users can view which projects they have access to by clicking the `User Profile` section in the dropdown menu in the top corner of the screen.
+
+[![User Profile Drop Down](images/gdc-user-profile-dropdown.png)](images/gdc-user-profile-dropdown.png "Click to see the full image.")
+
+Clicking this button shows the list of projects.
+
+[![User Profile](images/gdc-user-profile.png)](images/gdc-user-profile.png "Click to see the full image.")
+
+## GDC Authentication Tokens
+
+The GDC Data Portal provides authentication tokens for use with the GDC Data Transfer Tool or the GDC API. To download a token:
+
+1. Log into the GDC using your eRA Commons credentials.
+2. Click the username in the top right corner of the screen.
+3. Select the "Download token" option.
+
+![Token Download Button](images/gdc-data-portal-token-download.png)
+
+A new token is generated each time the `Download Token` button is clicked.
-### Individual Files Download
+For more information about authentication tokens, see [Data Security](../../Data/Data_Security/Data_Security.md#authentication-tokens).
-Similar to the files page, each row contains a download button to download a particular file individually.
+>__Note:__ The authentication token should be kept in a secure location, as it allows access to all data accessible by the associated user account.
-## Controlled Files
+## Logging Out
-If a user tries to download a cart containing controlled files and without being authenticated, a pop-up will be displayed to offer the user either to download only open access files or to login into the GDC Data Portal through eRA Commons. See [Authentication](Authentication.md) for details.
+To log out of the GDC, click the username in the top right corner of the screen, and select the Logout option.
-[![Cart Page](images/gdc-data-portal-download-cart.png)](images/gdc-data-portal-download-cart.png "Click to see the full image.")
+![Logout link](images/gdc-data-portal-token-download.png)
diff --git a/docs/Data_Portal/Users_Guide/Custom_Set_Analysis.md b/docs/Data_Portal/Users_Guide/Custom_Set_Analysis.md
index 6cfdc8d18..40231a40a 100644
--- a/docs/Data_Portal/Users_Guide/Custom_Set_Analysis.md
+++ b/docs/Data_Portal/Users_Guide/Custom_Set_Analysis.md
@@ -1,6 +1,6 @@
-# Custom Set Analysis
+# Analysis
-In addition to the [Exploration page](Exploration.md), the GDC Data Portal also has features used to save and compare sets of cases, genes, and mutations. These sets can either be generated with existing filters (e.g. males with lung cancer) or through custom selection (e.g. a user-generated list of case IDs).
+In addition to the [Exploration Page](Exploration.md), the GDC Data Portal also has features used to save and compare sets of cases, genes, and mutations. These sets can either be generated with existing filters (e.g. males with lung cancer) or through custom selection (e.g. a user-generated list of case IDs).
Note that saving a set only saves the type of entity included in the set. For example, a saved case set will not include filters that were applied to genes or mutations. Please be aware that your custom sets are deleted during each new GDC data release. You can export them and re-upload them in the "Manage Sets" link at the top right of the Portal.
@@ -8,17 +8,53 @@ Note that saving a set only saves the type of entity included in the set. For ex
Cohort sets are completely customizable and can be generated for cases, genes, or mutations using the following methods:
-__Upload ID Set:__ This feature is available in the "Manage Sets" link at the top right of the Portal. Choose "Upload Set" and then select whether the set comprises cases, genes, or mutations. A set of IDs (IDs* or UUIDs) can then be uploaded in a text file or copied and pasted into the list of identifiers field along with a name identifying the set. Once the list of identifiers is uploaded, they are validated and grouped according to whether the identifier matched an existing GDC ID or did not match ("Unmatched").
+__Apply Filters in Exploration:__ Sets can be assembled using the existing filters in the Exploration page. They can be saved by choosing the "Save/Edit Case Set" button under the pie charts for case sets. This will prompt a decision to save as new case set. The same can be done for both gene and mutation filters, and can be applied and saved in the Genes and Mutations tab, respectively.
+
+[![Exploration Set](images/GDC-ExplorationSet-Cohort_v2.png)](images/GDC-ExplorationSet-Cohort_v2.png "Click to see the full image.")
+
+__Upload ID Set:__ This feature is available in the "Manage Sets" link at the top right of the Portal. Choose "Upload Set" and then select whether the set comprises cases, genes, or mutations. A set of IDs or UUIDs can then be uploaded in a text file or copied and pasted into the list of identifiers field along with a name identifying the set. Once the list of identifiers is uploaded, the IDs are validated and grouped according to whether or not the identifier matched an existing GDC ID.
[![Upload Set](images/GDC-UploadSet-Cohort_v2.png)](images/GDC-UploadSet-Cohort_v2.png "Click to see the full image.")
-\* This is referred to as a `submitter_id` in the GDC API, which is a non-UUID identifier such as a TCGA barcode.
+### Upload Case Set
-__Apply Filters in Exploration:__ Sets can be assembled using the existing filters in the Exploration page. They can be saved by choosing the "Save/Edit Case Set" button under the pie charts for case sets. This will prompt a decision to save as new case set.
+In the `Cases` filters panel, instead of supplying cases one-by-one, users can supply a list of cases. Clicking on the `Upload Case Set` button will launch a dialog as shown below, where users can supply a list of cases or upload a comma-separated text file of cases.
-Similarly, gene and mutation filters can be applied and saved in the Exploration page in the Genes and Mutations tab, respectively.
+[![Upload Case Set](images/gdc-exploration-case-set.png)](images/gdc-exploration-case-set.png "Click to see the full image.")
-[![Exploration Set](images/GDC-ExplorationSet-Cohort_v2.png)](images/GDC-ExplorationSet-Cohort_v2.png "Click to see the full image.")
+After supplying a list of cases, a table below will appear which indicates whether the case was found.
+
+[![Upload Case Set Validation](images/gdc-exploration-case-set-validation.png)](images/gdc-exploration-case-set-validation.png "Click to see the full image.")
+
+Clicking on `Submit` will filter the results in the Exploration Page by those cases.
+
+[![Upload Case Set Results](images/case-set-filter_v3.png)](images/case-set-filter_v2.png "Click to see the full image.")
+
+### Upload Gene Set
+
+In the `Genes` filters panel, instead of supplying genes one-by-one, users can supply a list of genes. Clicking on the `Upload Gene Set` button will launch a dialog as shown below, where users can supply a list of genes or upload a comma-separated text file of genes.
+
+[![Upload Gene Set](images/Exploration-Upload-Gene-Set.png)](images/Exploration-Upload-Gene-Set.png "Click to see the full image.")
+
+After supplying a list of genes, a table below will appear which indicates whether the gene was found.
+
+[![Upload Gene Set Validation](images/Exploration-Upload-Gene-Set-Validation.png)](images/Exploration-Upload-Gene-Set-Validation.png "Click to see the full image.")
+
+Clicking on `Submit` will filter the results in the Exploration Page by those genes.
+
+### Upload Mutation Set
+
+In the `Mutations` filters panel, instead of supplying mutation id's one-by-one, users can supply a list of mutations. Clicking on the `Upload Mutation Set` button will launch a dialog as shown below, where users can supply a list of mutations or upload a comma-separated text file of mutations.
+
+[![Upload Case Set](images/gdc-exploration-mutation-set.png)](images/gdc-exploration-mutation-set.png "Click to see the full image.")
+
+After supplying a list of mutations, a table below will appear which indicates whether the mutation was found.
+
+[![Upload Case Set Validation](images/gdc-exploration-mutation-set-validation.png)](images/gdc-exploration-mutation-set-validation.png "Click to see the full image.")
+
+Clicking on `Submit` will filter the results in the Exploration Page by those mutations.
+
+[![Upload Case Set Results](images/mutation-set-filter.png)](images/mutation-set-filter.png "Click to see the full image.")
## Analysis Page
Clicking on the `Analysis` button in the top toolbar will launch the Analysis Page which displays the various options available for comparing saved sets.
@@ -27,8 +63,8 @@ Clicking on the `Analysis` button in the top toolbar will launch the Analysis Pa
There are two tabs on this page:
-* __Launch Analysis__: Where users can select either to do `Set Operations` or `Cohort Comparison`
-* __Results__: Where users can view the results of current or previous set analyses
+* __Launch Analysis__: Where users can select either to do `Set Operations` or `Cohort Comparison`.
+* __Results__: Where users can view the results of current or previous set analyses.
## Analysis Page: Set Operations
@@ -38,20 +74,19 @@ Up to three sets of the same set type can be compared and exported based on comp
* __Venn Diagram:__ Visually displays the overlapping items included within the three sets. Subsets based on overlap can be selected by clicking one or many sections of the Venn diagram. As sections of the Venn Diagram become highlighted in blue, their corresponding row in the overlap table becomes highlighted.
-* __Summary Table:__ Displays the alias, item type, and name for each set included in this analysis
-
-* __Overlap Table:__ Displays the number of overlapping items with set operations rather than a visual diagram. Subsets can be selected by checking boxes in the "Select" column, which will highlight the corresponding section of the Venn Diagram. As rows are selected, the "Union of selected sets" row is populated. Each row has an option to save the subset as a new set, export the set as a TSV, or view files in the repository. The links that correspond to the number of items in each row will open the cohort in the Exploration page.
+* __Summary Table:__ Displays the alias, item type, and name for each set included in this analysis.
+* __Overlap Table:__ Displays the number of overlapping items with set operations rather than a visual diagram. Subsets can be selected by checking boxes in the "Select" column, which will highlight the corresponding section of the Venn Diagram. As rows are selected, the "Union of selected sets" row is populated. Each row has an option to save the subset as a new set, export the set as a TSV, or view files in the repository. The links that correspond to the number of items in each row will open the cohort in the Exploration Page.
## Analysis Tab: Cohort Comparison
The "Cohort Comparison" analysis displays a series of graphs and tables that demonstrate the similarities and differences between two case sets. The following features are displayed for each two sets:
-* A key detailing the number of cases in each cohort and the color that represents each (blue/gold)
+* A key detailing the number of cases in each cohort and the color that represents each (blue/gold).
-* A Venn diagram, which shows the overlap between the two cohorts. The Venn diagram can be opened in a 'Set Operations' tab by choosing "Open venn diagram in new tab"
+* A Venn diagram, which shows the overlap between the two cohorts. The Venn diagram can be opened in a 'Set Operations' tab by choosing "Open Venn diagram in new tab".
-* A selectable [survival plot](Projects/#survival-analysis) that compares both sets with information about the percentage of represented cases
+* A selectable [survival plot](Exploration.md#survival-analysis) that compares both sets with information about the percentage of represented cases.
[![Top Cohort](images/GDC-Cohort-Comparison-Top.png)](images/GDC-Cohort-Comparison-Top.png "Click to see the full image.")
diff --git a/docs/Data_Portal/Users_Guide/Exploration.md b/docs/Data_Portal/Users_Guide/Exploration.md
index 4d571c458..09aa1a5ce 100644
--- a/docs/Data_Portal/Users_Guide/Exploration.md
+++ b/docs/Data_Portal/Users_Guide/Exploration.md
@@ -1,6 +1,6 @@
# Exploration
-The Exploration page allows users to explore data in the GDC using advanced filters/facets, which includes those on a gene and mutation level. Users choose filters on specific `Cases`, `Genes`, and/or `Mutations` on the left of this page and then can visualize these results on the right. The Gene/Mutation data for these visualizations comes from the Open-Access MAF files on the GDC Portal.
+The Exploration Page allows users to explore data in the GDC using advanced filters/facets, which includes those on a gene and mutation level. Users choose filters on specific `Cases`, `Genes`, and/or `Mutations` on the left of this page and then can visualize these results on the right. The Gene/Mutation data for these visualizations comes from the Open-Access MAF files on the GDC Data Portal.
[![Exploration Page](images/GDC-Exploration-Page_v5.png)](images/GDC-Exploration-Page_v4.png "Click to see the full image.")
@@ -15,34 +15,19 @@ The first tab of filters is for cases in the GDC.
These criteria limit the results only to specific cases within the GDC. The default filters available are:
-* __Case__: Specify individual cases using submitter ID (barcode), UUID, or list of Cases ('Case Set')
-* __Case Submitter ID__: Search for cases using a part (prefix) of the submitter ID (barcode).
-* __Primary Site__: Anatomical site of the cancer under investigation or review.
-* __Program__: A cancer research program, typically consisting of multiple focused projects.
-* __Project__: A cancer research project, typically part of a larger cancer research program.
-* __Disease Type__: Type of cancer studied.
-* __Gender__: Gender of the patient.
-* __Age at Diagnosis__: Patient age at the time of diagnosis.
-* __Vital Status__: Indicator of whether the patient was living or deceased at the date of last contact.
-* __Days to Death__: Number of days from date of diagnosis to death of the patient.
-* __Race__: Race of the patient.
-* __Ethnicity__: Ethnicity of the patient.
+* __Case:__ Specify individual cases using submitter ID (barcode), UUID, or list of Cases ('Case Set').
+* __Primary Site:__ Anatomical site of the cancer under investigation or review.
+* __Program:__ A cancer research program, typically consisting of multiple focused projects.
+* __Project:__ A cancer research project, typically part of a larger cancer research program.
+* __Disease Type:__ Type of cancer studied.
+* __Gender:__ Gender of the patient.
+* __Age at Diagnosis:__ Patient age at the time of diagnosis.
+* __Vital Status:__ Indicator of whether the patient was living or deceased at the date of last contact.
+* __Days to Death:__ Number of days from date of diagnosis to death of the patient.
+* __Race:__ Race of the patient.
+* __Ethnicity:__ Ethnicity of the patient.
-In addition to the defaults, users can add additional case filters by clicking on the link titled 'Add a Case Filter'
-
-#### Upload Case Set
-
-In the `Cases` filters panel, instead of supplying cases one-by-one, users can supply a list of cases. Clicking on the `Upload Case Set` button will launch a dialog as shown below, where users can supply a list of cases or upload a comma-separated text file of cases.
-
-[![Upload Case Set](images/gdc-exploration-case-set.png)](images/gdc-exploration-case-set.png "Click to see the full image.")
-
-After supplying a list of cases, a table below will appear which indicates whether the case was found.
-
-[![Upload Case Set Validation](images/gdc-exploration-case-set-validation.png)](images/gdc-exploration-case-set-validation.png "Click to see the full image.")
-
-Clicking on `Submit` will filter the results in the Exploration Page by those cases.
-
-[![Upload Case Set Results](images/case-set-filter_v3.png)](images/case-set-filter_v2.png "Click to see the full image.")
+In addition to the defaults, users can add additional case filters by clicking on the link titled ["Add a Case Filter"](Repository.md#adding-custom-facets).
### Gene Filters
@@ -52,25 +37,13 @@ The second tab of filters is for genes affected by mutations in the GDC.
The second tab of filters are for specific genes. Users can filter by:
-* __Gene__ - Entering in a specific Gene Symbol, ID, or list of Genes ('Gene Set')
-* __Biotype__ - Classification of the type of gene according to Ensembl. The biotypes can be grouped into protein coding, pseudogene, long noncoding and short noncoding. Examples of biotypes in each group are as follows:
- * __Protein coding__: IGC gene, IGD gene, IG gene, IGJ gene, IGLV gene, IGM gene, IGV gene, IGZ gene, nonsense mediated decay, nontranslating CDS, non stop decay, polymorphic pseudogene, TRC gene, TRD gene, TRJ gene.
- * __Pseudogene__: disrupted domain, IGC pseudogene, IGJ pseudogene, IG pseudogene, IGV pseudogene, processed pseudogene, transcribed processed pseudogene, transcribed unitary pseudogene, transcribed unprocessed pseudogene, translated processed pseudogene, TRJ pseudogene, unprocessed pseudogene
- * __Long noncoding__: 3prime overlapping ncrna, ambiguous orf, antisense, antisense RNA, lincRNA, ncrna host, processed transcript, sense intronic, sense overlapping
- * __Short noncoding__: miRNA, miRNA_pseudogene, miscRNA, miscRNA pseudogene, Mt rRNA, Mt tRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, tRNA_pseudogene
-* __Is Cancer Gene Census__ - Whether or not a gene is part of [The Cancer Gene Census](http://cancer.sanger.ac.uk/census/)
-
-#### Upload Gene Set
-
-In the `Genes` filters panel, instead of supplying genes one-by-one, users can supply a list of genes. Clicking on the `Upload Gene Set` button will launch a dialog as shown below, where users can supply a list of genes or upload a comma-separated text file of genes.
-
-[![Upload Gene Set](images/Exploration-Upload-Gene-Set.png)](images/Exploration-Upload-Gene-Set.png "Click to see the full image.")
-
-After supplying a list of genes, a table below will appear which indicates whether the gene was found.
-
-[![Upload Gene Set Validation](images/Exploration-Upload-Gene-Set-Validation.png)](images/Exploration-Upload-Gene-Set-Validation.png "Click to see the full image.")
-
-Clicking on `Submit` will filter the results in the Exploration Page by those genes.
+* __Gene:__ Specify a Gene Symbol, ID, or list of Genes ('Gene Set').
+* __Biotype:__ Classification of the type of gene according to Ensembl. The biotypes can be grouped into protein coding, pseudogene, long noncoding and short noncoding. Examples of biotypes in each group are as follows:
+ * __Protein coding:__ IGC gene, IGD gene, IG gene, IGJ gene, IGLV gene, IGM gene, IGV gene, IGZ gene, nonsense mediated decay, nontranslating CDS, non stop decay, polymorphic pseudogene, TRC gene, TRD gene, TRJ gene, TRV gene.
+ * __Pseudogene:__ disrupted domain, IGC pseudogene, IGJ pseudogene, IG pseudogene, IGV pseudogene, processed pseudogene, transcribed processed pseudogene, transcribed unitary pseudogene, transcribed unprocessed pseudogene, translated processed pseudogene, translated unprocessed pseudogene, TRJ pseudogene, TRV pseudogene, unprocessed pseudogene.
+ * __Long noncoding:__ 3 prime overlapping ncrna, ambiguous orf, antisense, antisense RNA, lincRNA, macro lincRNA, ncrna host, processed transcript, sense intronic, sense overlapping.
+ * __Short noncoding:__ miRNA, miRNA pseudogene, miscRNA, miscRNA pseudogene, Mt rRNA, Mt tRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, tRNA pseudogene, vaultRNA.
+* __Is Cancer Gene Census:__ Whether or not a gene is part of [The Cancer Gene Census](http://cancer.sanger.ac.uk/census/).
### Mutation Filters
@@ -80,169 +53,329 @@ The final tab of filters is for specific mutations.
Users can filter by:
-* __Mutation__ - Unique ID for that mutation. Users can use the following:
+* __Mutation:__ Unique ID for that mutation. Users can use the following:
* UUID - c7c0aeaa-29ed-5a30-a9b6-395ba4133c63
* DNA Change - chr12:g.121804752delC
* COSMIC ID - COSM202522
- * List of any mutation UUIDs or DNA Change id's ('Mutation Set')
-* __Consequence Type__ - Consequence type of this variation; [sequence ontology](http://www.sequenceontology.org/) terms
-* __Impact__ - A subjective classification of the severity of the variant consequence. This information comes from the [Ensembl VEP](http://www.ensembl.org/info/genome/variation/predicted_data.html).
-* __Type__ - A general classification of the mutation
-* __Variant Caller__ - The variant caller used to identify the mutation
-* __COSMIC ID__ - The identifier of the gene or mutation maintained in COSMIC, the Catalogue Of Somatic Mutations In Cancer
-* __dbSNP rs ID__ - The reference SNP identifier maintained in dbSNP
+ * List of any mutation UUIDs or DNA Change id's ('Mutation Set').
+* __Consequence Type:__ Consequence type of this variation; [sequence ontology](http://www.sequenceontology.org/) terms.
+* __Impact:__ A subjective classification of the severity of the variant consequence. These scores are determined using the three following tools:
+ * __[Ensembl VEP](http://useast.ensembl.org/info/genome/variation/prediction/index.html):__
+ * __HIGH (H):__ The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay.
+ * __MODERATE (M):__ A non-disruptive variant that might change protein effectiveness.
+ * __LOW (L):__ Assumed to be mostly harmless or unlikely to change protein behavior.
+ * __MODIFIER (MO):__ Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact.
+ * __[PolyPhen](http://genetics.bwh.harvard.edu/pph/):__
+ * __probably damaging (PR):__ It is with high confidence supposed to affect protein function or structure.
+ * __possibly damaging (PO):__ It is supposed to affect protein function or structure.
+ * __benign (BE):__ Most likely lacking any phenotypic effect.
+ * __unknown (UN):__ When in some rare cases, the lack of data does not allow PolyPhen to make a prediction.
+ * __[SIFT](http://sift.jcvi.org/):__
+ * __tolerated:__ Not likely to have a phenotypic effect.
+ * __tolerated_low_confidence:__ More likely to have a phenotypic effect than 'tolerated'.
+ * __deleterious:__ Likely to have a phenotypic effect.
+ * __deleterious_low_confidence:__ Less likely to have a phenotypic effect than 'deleterious'.
+* __Type:__ A general classification of the mutation.
+* __Variant Caller:__ The variant caller used to identify the mutation.
+* __COSMIC ID:__ This option will filter out only mutations with a COSMIC ID.
+* __dbSNP rs ID:__ This option will filter out only mutations with a SNP identifer maintained in dbSNP.
+
+## Results
-#### Upload Mutation Set
+As users add filters to the data on the Exploration Page, the Results section will automatically be updated. Results are divided into different tabs: `Cases`, `Genes`, `Mutations`, and `OncoGrid`.
-In the `Mutations` filters panel, instead of supplying mutation id's one-by-one, users can supply a list of mutations. Clicking on the `Upload Mutation Set` button will launch a dialog as shown below, where users can supply a list of mutations or upload a comma-separated text file of mutations.
+To illustrate these tabs, Case, Gene, and Mutation filters have been chosen (Genes in the Cancer Gene Census, that have a missense variant for the TCGA-BRCA project) and a description of what each tab displays follows.
-[![Upload Case Set](images/gdc-exploration-mutation-set.png)](images/gdc-exploration-case-set.png "Click to see the full image.")
+### Cases
-After supplying a list of mutations, a table below will appear which indicates whether the mutation was found.
+The `Cases` tab gives an overview of all the cases/patients who correspond to the filters chosen (Cohort).
-[![Upload Case Set Validation](images/gdc-exploration-mutation-set-validation.png)](images/gdc-exploration-case-set-validation.png "Click to see the full image.")
+[![Exploration Case Example](images/Exploration-Case-Example_v3.png)](images/Exploration-Case-Example_v2.png "Click to see the full image.")
-Clicking on `Submit` will filter the results in the Exploration Page by those mutations.
+The top of this section contains a few pie graphs with categorical information regarding the Primary Site, Project, Disease Type, Gender, and Vital Status.
-[![Upload Case Set Results](images/mutation-set-filter.png)](images/case-set-filter.png "Click to see the full image.")
+Below these pie charts is a tabular view of cases, which can be exported, sorted and saved using the buttons on the right and includes the following information:
-## Results
+* __Case ID (Submitter ID):__ The Case ID / submitter ID of that case/patient (i.e. TCGA Barcode).
+* __Project:__ The study name for the project for which the case belongs.
+* __Primary Site:__ The primary site of the cancer/project.
+* __Gender:__ The gender of the case.
+* __Files:__ The total number of files available for that case.
+* __Available Files per Data Category:__ Seven columns displaying the number of files available in each of the seven data categories. These link to the files for the specific case.
+* __# Mutations:__ The number of SSMs (simple somatic mutations) detected in that case.
+* __# Genes:__ The number of genes affected by mutations in that case.
+* __Slides:__ The total number of slides available for that case. For more information about [slide images](Repository.md#image-viewer-features).
-As users add filters to the data on the Exploration Page, the Results section will automatically be updated. Results are divided into different tabs: `Cases`, `Genes`, `Mutations`, and `OncoGrid`.
+>__Note__: By default, the UUID is not displayed on summary page tables. You can display the UUID by clicking on the icon with 3 parallel lines and checking the UUID option.
-To illustrate these tabs, Case, Gene, and Mutation filters have been chosen ( Genes in the Cancer Gene Census, that have HIGH VEP Impact for the TCGA-BRCA project) and a description of what each tab displays follows.
+### Case Summary Page
+The Case Summary Page displays case details including the project and disease information, data files that are available for that case, and the experimental strategies employed. A button in the top-right corner of the page allows the user to add all files associated with the case to the file [cart](Cart.md).
-#### Cases
+[![Case Page](images/gdc-case-entity-page.png)](images/gdc-case-entity-page.png "Click to see the full image.")
-The `Cases` tab gives an overview of all the cases/patients who correspond to the filters chosen (Cohort).
+#### Clinical and Biospecimen Information
-[![Exploration Case Example](images/Exploration-Case-Example_v3.png)](images/Exploration-Case-Example_v2.png "Click to see the full image.")
+The page also provides clinical and biospecimen information about that case. Links to export clinical and biospecimen information in JSON format are provided.
-The top of this section contains a few pie graphs with categorical information regarding the Primary Site, Project, Disease Type, Gender, and Vital Status.
+[![Case Page, Clinical and Biospecimen](images/gdc-case-clinical-biospecimen_v3.png)](images/gdc-case-clinical-biospecimen_v3.png "Click to see the full image.")
-Below these pie charts is a tabular view of cases (which can be exported, sorted and saved using the buttons on the right), that includes the following information:
+For clinical records that support multiple records of the same type (Diagnoses, Family Histories, or Exposures), a UUID of the record is provided at the top of the corresponding tab.
-* __Case ID (Submitter ID):__ The Case ID / submitter ID of that case/patient (i.e. TCGA Barcode)
-* __Project:__ The study name for the project for which the case belongs
-* __Primary Site:__ The primary site of the cancer/project
-* __Gender:__ The gender of the case
-* __Files:__ The total number of files available for that case
-* __Available Files per Data Category:__ Five columns displaying the number of files available in each of the five data categories. These link to the files for the specific case.
-* __# Mutations:__ The number of SSMs (simple somatic mutations) detected in that case
-* __# Genes:__ The number of genes affected by mutations in that case
-* __Slides:__ The total number of slides available for that case.
+#### Biospecimen Search
-*Note: By default, the Case UUID is not displayed. You can display the UUID of the case, but clicking on the icon with 3 parallel lines, and choose to display the Case UUID*
+A search filter just below the biospecimen section can be used to find and filter biospecimen data. The wildcard search will highlight entities in the tree that match the characters typed. This will search both the case submitter ID, as well as the additional metadata for each entity. For example, searching 'Primary Tumor' will highlight samples that match that type.
-#### Genes
+[![Biospecimen Search](images/gdc_case_biospecimen_search_v3.png)](images/gdc_case_biospecimen_search_v3.png "Click to see the full image.")
+
+#### Most Frequent Somatic Mutations for a Case
+
+The Case Entity Page also lists the mutations found in that particular case.
+
+[![Case Page](images/gdc-case-entity-mfm.png)](images/gdc-case-entity-mfm.png "Click to see the full image.")
+
+For more information, please go to the [Most Frequent Somatic Mutation](#most-frequent-somatic-mutations) section.
+
+### Genes
The `Genes` tab will give an overview of all the genes that match the criteria of the filters (Cohort).
-[![Exploration Gene Example](images/Exploration-Gene-Example.png)](images/Exploration-Gene-Example.png "Click to see the full image.")
+[![Exploration Gene Example](images/Exploration-Gene-Example2.png)](images/Exploration-Gene-Example2.png "Click to see the full image.")
-The top of this section contains a survival plot of all the cases within the specified Exploration page search, in addition to a bar graph of the most frequently mutated genes. Hovering over each bar in the plot will display information about the percentage of cases affected. Users may choose to download the underlying data in JSON or TSV format or an image of the graph in SVG or PNG format by clicking the `download` icon at the top of each graph.
+The top of this tab contains a bar graph of the most frequently mutated genes. Hovering over each bar in the plot will display information about the percentage of cases affected. In addition, this section contains a survival curve. The survival curve is calculated using the Kaplan-Meier estimator based on all cases with survival data within the specified Exploration Page search. For more information on how these values are determined, please go to the [Survival Analysis](#survival-analysis) section. Users may choose to download the underlying data in JSON or TSV format or an image of the graph in SVG or PNG format by clicking the `download` icon at the top of each graph.
Below these graphs is a tabular view of the genes affected, which includes the following information:
-* __Symbol:__ The gene symbol, which links to the Gene Summary Page
-* __Name:__ Full name of the gene
-* __Cytoband:__ The location of the mutation on the chromosome in terms of Giemsa-stained samples.
-* __Type:__ The type of gene
-* __# Affected Cases in Cohort:__ The number of cases affected in the Cohort
-* __# Affected Cases Across all Projects:__ The number of cases within all the projects in the GDC that contain a mutation on this gene. Clicking the red arrow will display the cases broken down by project
-* __# Mutations:__ The number of SSMs (simple somatic mutations) detected in that gene
-* __Annotations:__ Includes a COSMIC symbol if the gene belongs to [The Cancer Gene Census](http://cancer.sanger.ac.uk/census/)
-* __Survival Analysis:__ An icon that, when clicked, will plot the survival rate between cases in the project with mutated and non-mutated forms of the gene
+* __Symbol:__ The gene symbol, which links to the Gene Summary Page.
+* __Name:__ Full name of the gene.
+* __# SSM Affected Cases in Cohort:__ The number of cases affected by SSMs (simple somatic mutations) in the Cohort.
+* __# SSM Affected Cases Across the GDC:__ The number of cases within all the projects in the GDC that contain a mutation on this gene. Clicking the red arrow will display the cases broken down by project.
+* __# CNV Gain:__ The number of CNV (copy number variation) events detected in that gene which resulted in an increase (gain) in the gene's copy number.
+* __# CNV Loss:__ The number of CNV events detected in that gene which resulted in a decrease (loss) in the gene's copy number.
+* __# Mutations:__ The number of SSMs (simple somatic mutations) detected in that gene.
+* __Annotations:__ Includes a COSMIC symbol if the gene belongs to [The Cancer Gene Census](http://cancer.sanger.ac.uk/census/).
+* __Survival:__ An icon that, when clicked, will plot the survival rate between cases in the project with mutated and non-mutated forms of the gene.
-#### Survival Analysis
+### Gene Summary Page
-Survival analysis is used to analyze the occurrence of event data over time. In the GDC, survival analysis is performed on the mortality of the cases. Survival analysis requires:
+Gene Summary Pages describe each gene with mutation data and provides results related to the analyses that are performed on these genes.
-* Data on the time to a particular event (days to death or last follow up)
- * Fields: __diagnoses.days_to_death__ and __diagnoses.days_to_last_follow_up__
-* Information on whether the event has occurred (alive/deceased)
- * Fields: __diagnoses.vital_status__
-* Data split into different categories or groups (i.e. gender, etc.)
- * Fields: __demographic.gender__
+The summary section of the Gene Page contains the following information:
-The survival analysis in the GDC uses a Kaplan-Meier estimator:
+[![Gene Summary](images/GDC-Gene-Summary.png)](images/GDC-Gene-Summary.png "Click to see the full image.")
-[![Kaplan-Meier Estimator](images/gdc-kaplan-meier-estimator.png)](images/gdc-kaplan-meier-estimator "Click to see the full image.")
+* __Symbol:__ The gene symbol.
+* __Name:__ Full name of the gene.
+* __Synonyms:__ Synonyms of the gene name or symbol, if available.
+* __Type:__ A broad classification of the gene.
+* __Location:__ The chromosome on which the gene is located and its coordinates.
+* __Strand:__ If the gene is located on the forward (+) or reverse (-) strand.
+* __Description:__ A description of gene function and downstream consequences of gene alteration.
+- __Annotation:__ A notation/link that states whether the gene is part of [The Cancer Gene Census](http://cancer.sanger.ac.uk/census/).
-Where:
+#### External References
- * S(ti) is the estimated survival probability for any particular one of the t time periods
- * ni is the number of subjects at risk at the beginning of time period ti
- * and di is the number of subjects who die during time period ti
+A list with links that lead to external databases with additional information about each gene is displayed here. These external databases include:
-The table below is an example data set to calculate survival for a set of seven cases:
+* [Entrez](https://www.ncbi.nlm.nih.gov/gquery/)
+* [Uniprot](http://www.uniprot.org/)
+* [Hugo Gene Nomenclature Committee](http://www.genenames.org/)
+* [Online Mendelian Inheritance in Man](https://www.omim.org/)
+* [Ensembl](http://may2015.archive.ensembl.org/index.html)
-[![Sample Survival Analysis Table](images/gdc-sample-survival-table.png)](images/gdc-sample-survival-table.png "Click to see the full image.")
+#### Cancer Distribution
-The calculated cumulated survival probability can be plotted against the interval to obtain a survival plot like the one shown below.
+A table and two bar graphs show how many cases are affected by mutations and copy number variation within the gene as a ratio and percentage. Each row/bar represents the number of cases for each project. The final column in the table lists the number of unique mutations observed on the gene for each project.
-[![Sample Survival Analysis Plot](images/gdc-survival-plot.png)](images/gdc-survival-plot.png "Click to see the full image.")
+[![Cancer Distribution](images/GDC-Gene-CancerDist.png)](images/GDC-Gene-CancerDist.png "Click to see the full image.")
+
+#### Protein Viewer
+
+Mutations and their frequency across cases are mapped to a graphical visualization of protein-coding regions with a lollipop plot. Pfam domains are highlighted along the x-axis to assign functionality to specific protein-coding regions. The bottom track represents a view of the full gene length. Different transcripts can be selected by using the drop-down menu above the plot.
+
+[![Protein Plot](images/GDC-Gene-ProteinGraph.png)](images/GDC-Gene-ProteinGraph.png "Click to see the full image.")
+
+The panel to the right of the plot allows the plot to be filtered by mutation consequences or impact. The plot will dynamically change as filters are applied. Mutation consequence and impact is denoted in the plot by color.
+
+>__Note__: The impact filter on this panel will not display the annotations for alternate transcripts.
+
+The plot can be viewed at different zoom levels by clicking and dragging across the x-axis, clicking and dragging across the bottom track, or double clicking the pfam domain IDs. The `Reset` button can be used to bring the zoom level back to its original position. The plot can also be exported as a PNG image, SVG image or as JSON formatted text by choosing the `Download` button above the plot.
+
+#### Most Frequent Somatic Mutations
+
+The 20 most frequent mutations in the gene are displayed as a bar graph that indicates the number of cases that share each mutation.
+
+[![Gene MFM](images/GDC-Gene-MFM.png)](images/GDC-Gene-MFM.png "Click to see the full image.")
+
+A table is displayed below that lists information about each mutation including:
-#### Mutations
+* __DNA Change:__ The chromosome and starting coordinates of the mutation are displayed along with the nucleotide differences between the reference and tumor allele.
+* __Type:__ A general classification of the mutation.
+* __Consequences:__ The effects the mutation has on the gene coding for a protein (i.e. synonymous, missense, non-coding transcript).
+* __# Affected Cases in Gene:__ The number of affected cases, expressed as number across all mutations within the selected Gene.
+* __# Affected Cases Across GDC:__ The number of affected cases, expressed as number across all projects. Choosing the arrow next to the percentage will expand the selection with a breakdown of each affected project.
+* __Impact:__ A [subjective classification](#mutation-filters) of the severity of the variant consequence. This is determined by three different tools:
+ * __[Ensembl VEP](http://useast.ensembl.org/info/genome/variation/prediction/index.html)__
+ * __[PolyPhen](http://genetics.bwh.harvard.edu/pph/)__
+ * __[SIFT](http://sift.jcvi.org/)__
-The `Mutations` tab will give an overview of all the mutations who match the criteria of the filters (Cohort).
+
+Clicking the `Open in Exploration` button will navigate the user to the Exploration Page, showing the same results in the table (mutations filtered by the gene).
+
+### Mutations
+
+The `Mutations` tab will give an overview of all the mutations that match the criteria of the filters (Cohort).
+
+Open-access mutation data is displayed by defualt. To access controlled access mutations, users must apply to the correct data access authority, be granted access, and login to the portal. If a user is logged in and has been granted access to controlled-access mutations, they will be integrated with open-access mutations throughout the portal visualizations and counts.
[![Exploration Mutation Example](images/Exploration-Mutation-Example.png)](images/Exploration-Mutation-Example.png "Click to see the full image.")
-At the top of this tab is a survival plot of all the cases within the specified exploration page filters.
+At the top of this tab contains a survival curve. The survival curve is calculated using the Kaplan-Meier estimator based on all cases with survival data within the specified Exploration Page search. For more information on how these values are determined, please go to the [Survival Analysis](#survival-analysis) section. Users may choose to download the underlying data in JSON or TSV format or an image of the graph in SVG or PNG format by clicking the `download` icon at the top of the graph.
A table is displayed below that lists information about each mutation:
-* __DNA Change:__ The chromosome and starting coordinates of the mutation are displayed along with the nucleotide differences between the reference and tumor allele
-* __Type:__ A general classification of the mutation
-* __Consequences:__ The effects the mutation has on the gene coding for a protein (i.e. synonymous, missense, non-coding transcript). A link to the Gene Summary Page for the gene affected by the mutation is included
-* __# Affected Cases in Cohort:__ The number of affected cases in the Cohort as a fraction and as a percentage
-* __# Affected Cases in Across all Projects:__ The number of affected cases, expressed as number across all projects. This information comes from the [Ensembl VEP](http://www.ensembl.org/info/genome/variation/predicted_data.html). Choosing the arrow next to the percentage will display a breakdown of each affected project
-* __Impact (VEP):__ A subjective classification of the severity of the variant consequence. The categories are:
- * __HIGH (H)__: The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function, or triggering nonsense mediated decay
- * __MODERATE (M)__: A non-disruptive variant that might change protein effectiveness
- * __LOW (L)__: Assumed to be mostly harmless or unlikely to change protein behavior
- * __MODIFIER (MO)__: Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact
-* __Survival Analysis:__ An icon that when clicked, will plot the survival rate between the gene's mutated and non-mutated cases
+* __DNA Change:__ The chromosome and starting coordinates of the mutation are displayed along with the nucleotide differences between the reference and tumor allele.
+* __Type:__ A general classification of the mutation.
+* __Consequences:__ The effects the mutation has on the gene coding for a protein (i.e. synonymous, missense, non-coding transcript). A link to the [Gene Summary Page](Exploration.md#gene-summary-page) for the gene affected by the mutation is included.
+* __# Affected Cases in Cohort:__ The number of affected cases in the Cohort as a fraction and as a percentage.
+* __# Affected Cases in Across all Projects:__ The number of affected cases, expressed as number across all projects. Clicking the arrow next to the percentage will display a breakdown of each affected project.
+* __Impact:__ A [subjective classification](#mutation-filters) of the severity of the variant consequence. This is determined by three different tools:
+ * __[Ensembl VEP](http://useast.ensembl.org/info/genome/variation/prediction/index.html)__
+ * __[PolyPhen](http://genetics.bwh.harvard.edu/pph/)__
+ * __[SIFT](http://sift.jcvi.org/)__
+* __Survival:__ An icon that when clicked, will plot the survival rate between the gene's mutated and non-mutated cases.
+
+### Mutation Summary Page
+
+ The Mutation Summary Page contains information about one somatic mutation and how it affects the associated gene. Each mutation is identified by its chromosomal position and nucleotide-level change.
+
+ [![Mutation Summary](images/GDC-Mutation-Summary.png)](images/GDC-Mutation-Summary.png "Click to see the full image.")
+
+ - __UUID:__ A unique identifier (UUID) for this mutation.
+ - __DNA Change:__ Denotes the chromosome number, position, and nucleotide change of the mutation.
+ - __Type:__ A broad categorization of the mutation.
+ - __Reference Genome Assembly:__ The reference genome in which the chromosomal position refers to.
+ - __Allele in the Reference Assembly:__ The nucleotide(s) that compose the site in the reference assembly.
+ - __Functional Impact:__ A subjective classification of the severity of the variant consequence.
+
+#### External References
+
+ A separate panel contains links to databases that contain information about the specific mutation. These include [dbSNP](https://www.ncbi.nlm.nih.gov/projects/SNP/) and [COSMIC](http://cancer.sanger.ac.uk/cosmic).
+
+#### Consequences
+
+The consequences of the mutation are displayed in a table. The set of consequence terms, defined by the [Sequence Ontology](http://www.sequenceontology.org).
-*Note: By default, the Mutation UUID is not displayed. You can display the UUID of the case, but clicking on the icon with 3 parallel lines, and choose to display the Mutation UUID*
+ [![Mutation Consequences](images/GDC-Mutation-Consequences.png)](images/GDC-Mutation-Consequences.png "Click to see the full image.")
-#### OncoGrid
+The fields that describe each consequence are listed below:
-The Exploration page includes an OncoGrid plot of the cases with the most mutations, for the top 50 mutated genes affected by high impact mutations. Genes displayed on the left of the grid (Y-axis) correspond to individual cases on the bottom of the grid (X-axis).
+ * __Gene:__ The symbol for the affected gene.
+ * __AA Change:__ Details on the amino acid change, including compounds and position, if applicable.
+ * __Consequence:__ The biological consequence of each mutation.
+ * __Coding DNA Change:__ The specific nucleotide change and position of the mutation within the gene.
+* __Impact:__ A [subjective classification](#mutation-filters) of the severity of the variant consequence. This is determined by three different tools:
+ * __[Ensembl VEP](http://useast.ensembl.org/info/genome/variation/prediction/index.html)__
+ * __[PolyPhen](http://genetics.bwh.harvard.edu/pph/)__
+ * __[SIFT](http://sift.jcvi.org/)__
+ * __Strand:__ If the gene is located on the forward (+) or reverse (-) strand.
+ * __Transcript(s):__ The transcript(s) affected by the mutation. Each contains a link to the [Ensembl](https://www.ensembl.org) entry for the transcript.
-[![Exploration Oncogrid Example](images/Exploration-Oncogrid-Example.png)](images/Exploration-Oncogrid-Example.png "Click to see the full image.")
+#### Cancer Distribution
-The grid is color-coded with a legend at the top left which describes what type of mutation consequence is observed for each gene/case combination. Clinical information and the available data for each case are available at the bottom of the grid.
+A table and bar graph shows how many cases are affected by the particular mutation. Each row/bar represents the number of cases for each project.
+
+ [![Mutation Distribution](images/GDC-Mutation-CancerDist.png)](images/GDC-Mutation-CancerDist.png "Click to see the full image.")
+
+The table contains the following fields:
+
+ * __Project ID__: The ID for a specific project.
+ * __Disease Type__: The disease associated with the project.
+ * __Site__: The anatomical site affected by the disease.
+ * __# SMM Affected Cases__: The number of affected cases and total number of cases displayed as a fraction and percentage.
+
+#### Protein Viewer
+
+The protein viewer displays a plot representing the position of mutations along the polypeptide chain. The y-axis represents the number of cases that exhibit each mutation, whereas the x-axis represents the polypeptide chain sequence. [Pfam domains](http://pfam.xfam.org/) that were identified along the polypeptide chain are identified with colored rectangles labeled with pfam IDs. See the [Gene Summary Page](#gene-summary-page) for additional details about the [protein viewer](#protein-viewer).
+
+ [![Mutation Protein Graph](images/GDC-Mutation-ProteinGraph.png)](images/GDC-Mutation-ProteinGraph.png "Click to see the full image.")
+
+## OncoGrid
+
+The Exploration Page includes an OncoGrid plot of the cases with the most mutations, for the top 50 mutated genes affected by high impact mutations. Genes displayed on the left of the grid (Y-axis) correspond to individual cases on the bottom of the grid (X-axis). Additionally, the plot also indicates in each cell any CNV events detected for these top mutated cases and genes.
+
+[![Exploration Oncogrid Example](images/Exploration-Oncogrid-Example_v2.png)](images/Exploration-Oncogrid-Example_v2.png "Click to see the full image.")
+
+The grid is color-coded with a legend at the top which describes what type of mutation consequence and CNV event is observed for each gene/case combination. Clinical information and the available data for each case are available at the bottom of the grid.
The right side of the grid displays additional information about the genes:
* __Gene Sets:__ Describes whether a gene is part of [The Cancer Gene Census](http://cancer.sanger.ac.uk/census/). (The Cancer Gene Census is an ongoing effort to catalogue those genes for which mutations have been causally implicated in cancer)
-* __GDC:__ Identifies all cases in the GDC affected with a mutation in this gene
+* __# Cases Affected:__ Identifies all cases in the GDC affected with a mutation in this gene
-#### OncoGrid Options
+### OncoGrid Options
-To facilitate readability and comparisons, drag-and-drop can be used to reorder the gene rows. Double clicking a row in the "# Cases Affected" bar at the right side of the graphic launches the respective Gene Summary Page page. Hovering over a cell will display information about the mutation such as its ID, affected case, and biological consequence. Clicking on the cell will bring the user to the respective Mutation Summary page.
+To facilitate readability and comparisons, drag-and-drop can be used to reorder the gene rows. Double clicking a row in the "# Cases Affected" bar at the right side of the graphic launches the respective Gene Summary Page. Hovering over a cell will display information about the mutation such as its ID, affected case, and biological consequence. Clicking on the cell will bring the user to the respective Mutation Summary Page.
A tool bar at the top right of the graphic allows the user to export the data as a JSON object, PNG image, or SVG image. Seven buttons are available in this toolbar:
-* __Download:__ Users can choose to export the contents either to a static image file (PNG or SVG format) or the underlying data in JSON format
-* __Reload Grid:__ Sets all OncoGrid rows, columns, and zoom levels back to their initial positions
-* __Cluster Data:__ Clusters the rows and columns to place mutated genes with the same cases and cases with the same mutated genes together
-* __Toggle Heatmap:__ The view can be toggled between cells representing mutation consequences or number of mutations in each gene
-* __Toggle Gridlines:__ Turn the gridlines on and off
-* __Toggle Crosshairs:__ Turns crosshairs on, so that users can zoom into specific sections of the OncoGrid
-* __Fullscreen:__ Turns Fullscreen mode on/off
+* __Customize Colors:__ Users can customize the colors that represent mutation consequence types and CNV gains/losses.
+* __Download:__ Users can choose to export the contents either to a static image file (PNG or SVG format) or the underlying data in JSON format.
+* __Reload Grid:__ Sets all OncoGrid rows, columns, and zoom levels back to their initial positions.
+* __Cluster Data:__ Clusters the rows and columns to place mutated genes with the same cases and cases with the same mutated genes together.
+* __Toggle Heatmap:__ The view can be toggled between cells representing mutation consequences or number of mutations in each gene.
+* __Toggle Gridlines:__ Turn the gridlines on and off.
+* __Toggle Crosshairs:__ Turns crosshairs on, so that users can zoom into specific sections of the OncoGrid.
+* __Fullscreen:__ Turns Fullscreen mode on/off.
+
+### OncoGrid Color Picker
-### File Navigation
+To customize the colors for mutation consequence types and CNV gains/losses, a user can click the color picker icon in the OncoGrid toolbar.
+
+* __Customize Colors:__ Opens a control where the user can pick their own colors or apply a suggested theme and save their changes.
+* __Reset to Default:__ Resets all colors to the defaults initially used by OncoGrid.
+
+[![Exploration Oncogrid Color Picker](images/Exploration-Oncogrid-Color-Picker.png)](images/Exploration-Oncogrid-Color-Picker.png "Click to see the full image.")
+
+## File Navigation
After utilizing the Exploration Page to narrow down a specific cohort, users can find the specific files that relate to this group by clicking on the `View Files in Repository` button as shown in the image below.
-[![Exploration File Navigation](images/Exploration-View-Files_v3.png)](images/Exploration-View-Files_v2.png "Click to see the full image.")
+[![Exploration File Navigation](images/Exploration-View-Files_v3.png)](images/Exploration-View-Files_v3.png "Click to see the full image.")
Clicking this button will navigate the users to the Repository Page, filtered by the cases within the cohort.
-[![Input Set Explanation](images/gdc-input-set_v2.png)](images/gdc-input-set.png "Click to see the full image.")
+[![Input Set Explanation](images/gdc-input-set_v2.png)](images/gdc-input-set_v2.png "Click to see the full image.")
+
+The filters chosen on the Exploration Page are displayed as an `input set` on the Repository Page. Additional filters may be added on top of this `input set`, but the original set cannot be modified and instead a new `input set` must be created from original data.
+
+---
+
+## Survival Analysis
+
+The survival analysis, which is seen in both the `Gene` and `Mutation` tabs, is used to analyze the occurrence of event data over time. In the GDC, survival analysis is performed on the mortality of the cases. Thus, the values are retrieved from [GDC Data Dictionary](../../../Data_Dictionary) properties and a survival analysis requires the following fields:
+
+* Data on the time to a particular event (days to death or last follow up).
+ * Fields: __diagnoses.days_to_death__ or __diagnoses.days_to_last_follow_up__
+* Information on whether the event has occurred (alive/deceased).
+ * Fields: __diagnoses.vital_status__
+* Data split into different categories or groups (i.e. gender, etc.).
+ * Fields: __demographic.gender__
+
+The survival analysis in the GDC uses a Kaplan-Meier estimator:
-The filters chosen on the Exploration page are displayed as an `input set` on the Repository page. Additional filters may be added on top of this `input set`, but the original set cannot be modified and instead must be created from scratch again.
+[![Kaplan-Meier Estimator](images/gdc-kaplan-meier-estimator2.png)](images/gdc-kaplan-meier-estimator2.png "Click to see the full image.")
+
+Where:
+
+ * S(t) is the estimated survival probability for any particular one of the t time periods.
+ * ni is the number of subjects at risk at the beginning of time period ti.
+ * and di is the number of subjects who die during time period ti.
+
+The table below is an example data set to calculate survival for a set of seven cases:
+
+[![Sample Survival Analysis Table](images/gdc-sample-survival-table.png)](images/gdc-sample-survival-table.png "Click to see the full image.")
+
+The calculated cumulated survival probability can be plotted against the interval to obtain a survival plot like the one shown below.
+
+[![Sample Survival Analysis Plot](images/gdc-survival-plot.png)](images/gdc-survival-plot.png "Click to see the full image.")
diff --git a/docs/Data_Portal/Users_Guide/Genes_and_Mutations.md b/docs/Data_Portal/Users_Guide/Genes_and_Mutations.md
index d4cd7d5ff..23e39f275 100644
--- a/docs/Data_Portal/Users_Guide/Genes_and_Mutations.md
+++ b/docs/Data_Portal/Users_Guide/Genes_and_Mutations.md
@@ -27,9 +27,9 @@ A list with links that lead to external databases with additional information ab
### Cancer Distribution
-A table and bar graph show how many cases are affected by mutations within the gene as a ratio and percentage. Each row/bar represents the number of cases for each project. The final column in the table lists the number of unique mutations observed on the gene for each project.
+A table and two bar graphs (one for mutations, one for CNV events) show how many cases are affected by mutations and CNV events within the gene as a ratio and percentage. Each row/bar represents the number of cases for each project. The final column in the table lists the number of unique mutations observed on the gene for each project.
-[![Cancer Distribution](images/GDC-Gene-CancerDist.png)](images/GDC-Gene-CancerDist.png "Click to see the full image.")
+[![Cancer Distribution](images/GDC-Gene-CancerDist_v2.png)](images/GDC-Gene-CancerDist_v2.png "Click to see the full image.")
### Protein Viewer
diff --git a/docs/Data_Portal/Users_Guide/Getting_Started.md b/docs/Data_Portal/Users_Guide/Getting_Started.md
index f427d97a4..a841a2ea7 100644
--- a/docs/Data_Portal/Users_Guide/Getting_Started.md
+++ b/docs/Data_Portal/Users_Guide/Getting_Started.md
@@ -1,16 +1,15 @@
# Getting Started
-
## The GDC Data Portal: An Overview
The Genomic Data Commons (GDC) Data Portal provides users with web-based access to data from cancer genomics studies. Key GDC Data Portal features include:
-* Open, granular access to information about all datasets available in the GDC
-* Advanced search and visualization-assisted filtering of data files
-* Data visualization tools to support the analysis and exploration of data (including on a gene and mutation level from Open-Access MAF files)
-* Cart for collecting data files of interest
-* Authentication using eRA Commons credentials for access to controlled data files
-* Secure data download directly from the cart or using the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool)
+* Open, granular access to information about all datasets available in the GDC.
+* Advanced search and visualization-assisted filtering of data files.
+* Data visualization tools to support the analysis and exploration of data (including on a gene and mutation level from Open-Access MAF files).
+* Cart for collecting data files of interest.
+* Authentication using eRA Commons credentials and auathorization using dbGaP for access to controlled data files.
+* Secure data download directly from the cart or using the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool).
For more information about available datasets, see the [GDC Website](https://gdc.cancer.gov/about-data).
@@ -18,7 +17,7 @@ For more information about available datasets, see the [GDC Website](https://gdc
## Accessing the GDC Data Portal
-The GDC Data Portal is accessible using a web browser such as Chrome, Internet Explorer, and Firefox at the following URL:
+The GDC Data Portal is accessible using a web browser such as Chrome, Firefox, and Microsoft Edge at the following URL:
[https://portal.gdc.cancer.gov](https://portal.gdc.cancer.gov)
@@ -27,10 +26,7 @@ The front page displays a summary of all available datasets:
[![GDC Home Page](images/GDC-Home-Page.png)](images/GDC-Home-Page.png "Click to see the full image.")
-
-## Navigation
-
-### Views
+## Views
The GDC Data Portal provides five navigation options (*Views*) for browsing available harmonized datasets:
@@ -44,23 +40,21 @@ The GDC Data Portal provides five navigation options (*Views*) for browsing avai
* __Repository__: The Repository link directs users to the [Repository Page](Repository.md). Here users can see the data files available for download at the GDC and apply file/case filters to narrow down their search.
-* __Image Viewer__: The [Image viewer](Image_viewer.md) allows users to visualize tissue slide images.
-
* __Human Outline__: The home page displays a human anatomical outline that can be used to refine their search. Choosing an associated organ will direct the user to a listing of all projects associated with that primary site. For example, clicking on the human brain will show only cases and projects associated with brain cancer (TCGA-GBM and TCGA-LGG). The number of cases associated with each primary site is also displayed here and separated by project.
Each view provides a distinct representation of the same underlying set of GDC data and metadata. The GDC also provides access to certain unharmonized data files generated by GDC-hosted projects. These files and their associated metadata are not represented in the views above; instead they can be found in the [GDC Legacy Archive](Legacy_Archive.md).
The Projects, Exploration, Analysis and Repository pages can be accessed from the GDC Data Portal front page and from the toolbar (see below). The annotations view is accessible from Repository view. A link to the GDC Legacy Archive is available on the GDC Data Portal front page and in the GDC Apps menu (see below).
-### Toolbar
+## Toolbar
The toolbar available at the top of all pages in the GDC Data Portal provides convenient navigation links and access to authentication and quick search.
-The left portion of this toolbar provides access to the Home Page, __Projects Page__, __Exploration Page__, __Analysis Page__, and a link to __Repository Page__:
+The left portion of this toolbar provides access to the __Home Page__, __Projects Page__, __Exploration Page__, __Analysis Page__, and a link to __Repository Page__:
[![GDC Data Portal Toolbar (Left)](images/gdc-data-portal-top-menu-bar-left.png)](images/gdc-data-portal-top-menu-bar-left.png "Click to see the full image.")
-The right portion of this toolbar provides access to [quick search](#quick-search), [manage sets](#manage-sets), [authentication functions](Authentication.md), the [cart](Cart.md), and the GDC Apps menu:
+The right portion of this toolbar provides access to [quick search](#quick-search), [manage sets](#manage-sets), [authentication functions](Repository.md#authentication), the [cart](Cart.md), and the GDC Apps menu:
[![GDC Data Portal Toolbar (Left)](images/gdc-data-portal-top-menu-bar-right.png)](images/gdc-data-portal-top-menu-bar-right.png "Click to see the full image.")
@@ -68,40 +62,41 @@ The GDC Apps menu provides links to all resources provided by the GDC, including
[![GDC Apps](images/gdc-data-portal-gdc-apps.png)](images/gdc-data-portal-gdc-apps.png "Click to see the full image.")
-### Tables
+## Tables
Tabular listings are the primary method of representing available data in the GDC Data Portal. Tables are available in all views and in the file cart. Users can customize each table by specifying columns, size, and sorting.
-#### Table Sort
-The *sort table* button is available in the top right corner of each table. To sort by a column, place a checkmark next to it and select the preferred sort direction. If multiple columns columns are selected for sorting, data is sorted column-by-column in the order that columns appear in the sort menu: the topmost selected column becomes the primary sorting parameter; the selected column below it is used for secondary sort, etc.
+### Table Sort
+
+The sort button is available in the top right corner of each table. To sort by a column, place a checkmark next to it and select the preferred sort direction. If multiple columns are selected for sorting, data is sorted column-by-column in the order that the columns appear in the sort menu: the topmost selected column becomes the primary sorting parameter; the selected column below it is used for secondary sort, etc.
[![Sorting a table](images/gdc-data-portal-table-sort.png)](images/gdc-data-portal-table-sort.png "Click to see the full image.")
-#### Table Arrangement
+### Table Arrangement
-The *arrange columns* button allows users to adjust the order of columns in the table and select which columns are displayed.
+The arrange button allows users to adjust the order of columns in the table and select which columns are displayed.
![Selecting table columns](images/gdc-data-portal-table-column-selection.png)
-#### Table Size
+### Table Size
Table size can be adjusted using the menu in the bottom left corner of the table. The menu sets the maximum number of rows to display. If the number of entries to be displayed exceeds the maximum number of rows, then the table will be paginated, and navigation buttons will be provided in the bottom right corner of the table to navigate between pages.
![Specifying table size](images/gdc-data-portal-table-size-and-pagination.png)
-#### Table Export
+### Table Export
In the Repository, Projects, and Annotations views, tables can be exported in either a JSON or TSV format. The `JSON` button will export the entire table's contents into a JSON file. The `TSV` button will export the current view of the table into a TSV file.
[![Table Columns Filtering](images/gdc-data-portal-table-export.png)](images/gdc-data-portal-table-export.png "Click to see the full image.")
-### Filtering and Searching
+## Filtering and Searching
The GDC Data Portal offers three different means of searching and filtering the available data: facet filters, quick search, and advanced search.
-#### Facet Filters
+### Facet Filters
Facets on the left of each view (Projects, Exploration, and Repository) represent properties of the data that can be used for filtering. Some of the available facets are project name, disease type, patient gender and age at diagnosis, and various data formats and categories. Each facet displays the name of the data property, the available values, and numbers of matching entities for each value (files, cases, mutations, genes, annotations, or projects, depending on the context).
@@ -113,9 +108,9 @@ Multiple selections within a facet are treated as an "OR" query: e.g. "Aligned R
The information displayed in each facet reflects this: in the example above, marking the "Aligned Reads" checkbox does not change the numbers or the available values in the _Data Type_ facet where the checkbox is found, but it does change the values available in the _Experimental Strategy_ facet. The _Experimental Strategy_ facet now displays only values from files of _Data Type_ "Aligned Reads".
-Custom facet filters can be added in [Repository View](Repository.md) to expand the GDC Data Portal's filtering capabilities.
+Custom facet filters can be added in the [Repository View](Repository.md) to expand the GDC Data Portal's filtering capabilities.
-#### Quick Search
+### Quick Search
The quick search feature allows users to find cases, files, mutations, or genes using a search query (i.e. UUID, filename, gene name, DNA Change, project name, id, disease type or primary site). Quick search is available by clicking on the magnifier in the right section of the toolbar (which appears on every page) or by using the search bar on the Home Page.
@@ -131,16 +126,16 @@ __Toolbar Quick Search:__
[![Quick Search, Searching for an Entity](images/gdc-quick-search2.png)](images/gdc-quick-search2.png "Click to see the full image.")
-#### Advanced Search
+### Advanced Search
Advanced Search is available in Repository View. It allows users to construct complex queries with a custom query language and auto-complete suggestions. See [Advanced Search](Advanced_Search.md) for details.
-#### Manage Sets
+## Manage Sets
The `Manage Sets` button at the top of the GDC Portal stores sets of cases, genes, or mutations of interest. On this page, users can review the sets that have been saved as well as upload new sets and delete existing sets.
[![Manage Sets](images/gdc-manage-sets.png)](images/gdc-manage-sets.png "Click to see the full image.")
-##### Upload Sets
+### Upload Sets
Clicking the `Upload Set` button shows options for creating Case, Gene, or Mutation sets.
@@ -154,7 +149,7 @@ Clicking the `Submit` button will add the set of items to the list of sets on th
[![New Sets Gene](images/gdc-new-set.png)](images/gdc-manage-sets.png "Click to see the full image.")
-##### Export Sets
+### Export Sets
Users can export selected sets on this page by first clicking the checkboxes next to each set, then clicking the `Export selected` button at the top of the table.
@@ -162,15 +157,15 @@ Users can export selected sets on this page by first clicking the checkboxes nex
A text file containing the UUID of each case, gene or mutation is downloaded after clicking this button.
-##### Review Sets
+### Review Sets
There are a few buttons in the list of sets that allows a user to get further information about each one.
* __# Items__: Clicking the link under the # Items column navigates the user to the Exploration page using the set as a filter.
-* __Download/View__: To the right of the # Items column are buttons that will download the list as a tsv or open the cases in the Repository page.
+* __Download/View__: To the right of the # Items column are buttons that will download the list as a TSV or open the cases in the Repository Page.
-##### Creating Sets from GDC Portal Filters
-Many pages on the GDC Portal have an option called `Save Sets` that allows users to save a group of cases, mutations, or genes for further analysis. After using the filtering options on the `Exploration` page as an example, users can click the `Save Case/Gene/Mutation Set` button to save this set.
+### Creating Sets from GDC Portal Filters
+Many pages on the GDC Portal have an option called `Save Sets` that allows users to save a group of cases, mutations, or genes for further analysis. After using the filtering options on the `Exploration` Page as an example, users can click the `Save Case/Gene/Mutation Set` button to save this set.
-[![Save Sets](images/gdc-exploration-save-sets.png)](images/gdc-quick-search2.png "Click to see the full image.")
+[![Save Sets](images/gdc-exploration-save-sets.png)](images/gdc-quick-search2.png "Click to see the full image.")
\ No newline at end of file
diff --git a/docs/Data_Portal/Users_Guide/Legacy_Archive.md b/docs/Data_Portal/Users_Guide/Legacy_Archive.md
index ba84c08ae..616a1eac8 100644
--- a/docs/Data_Portal/Users_Guide/Legacy_Archive.md
+++ b/docs/Data_Portal/Users_Guide/Legacy_Archive.md
@@ -20,7 +20,7 @@ The GDC Legacy Archive contains a limited set of features of the GDC Data Portal
### File Page
-The file page of the GDC Legacy Archive is similar to the [file page of the GDC Data Portal](Repository.md#file-summary-page). It does not include the Workflow, Reference Genome, and Read Groups sections as these are only applicable to harmonized data available in the GDC Data Portal. The Legacy Archive includes additional archive information as described below.
+The file page of the GDC Legacy Archive is similar to the [File Summary Page of the GDC Data Portal](Repository.md#file-summary-page). It does not include the Workflow, Reference Genome, and Read Groups sections as these are only applicable to harmonized data available in the GDC Data Portal. The Legacy Archive includes additional archive information as described below.
[![Files Entity Page](images/gdc-data-portal-files-entity-page-Archive-MagTab.png)](images/gdc-data-portal-files-entity-page-Archive-MagTab.png "Click to see the full image.")
@@ -28,9 +28,9 @@ The file page of the GDC Legacy Archive is similar to the [file page of the GDC
If a file was originally produced as part of an archive containing other files, the archive information (Archive ID and number of files in the archive) is displayed in the file properties and, if selected, the user will see a list of files containing all other files in that archive.
-#### Metadata files
+#### Metadata Files
-If a file has any associated MAGE-TAB or SRA XML metadata files, these files will be listed at the bottom of the page. These files will can be downloaded directly from here. Alternatively, metadata files can be downloaded from the file cart.
+If a file has any associated MAGE-TAB or SRA XML metadata files, these files will be listed at the bottom of the page. These files will can be downloaded directly from this page. Alternatively, metadata files can be downloaded from the file cart.
### File Cart
diff --git a/docs/Data_Portal/Users_Guide/Projects.md b/docs/Data_Portal/Users_Guide/Projects.md
index 9e8a9d28d..c1eb7ee72 100644
--- a/docs/Data_Portal/Users_Guide/Projects.md
+++ b/docs/Data_Portal/Users_Guide/Projects.md
@@ -1,34 +1,34 @@
# Projects
-## Summary
-At a high level, data in the Genomic Data Commons is organized by project. Typically, a project is a specific effort to look at particular type(s) of cancer undertaken as part of a larger cancer research program. The GDC Data Portal allows users to access aggregate project-level information via the Projects Page and Project Summary pages.
+At a high level, data in the Genomic Data Commons is organized by project. Typically, a project is a specific effort to look at particular type(s) of cancer undertaken as part of a larger cancer research program. The GDC Data Portal allows users to access aggregate project-level information via the Projects Page and Project Summary Pages.
## Projects Page
-The Projects Page provides an overview of all harmonized data available in the Genomic Data Commons, organized by project. It also provides filtering, navigation, and advanced visualization features that allow users to identify and browse projects of interest. Users can access Projects Page from the GDC Data Portal Home page, from the Data Portal toolbar, or directly at [https://portal.gdc.cancer.gov/projects](https://portal.gdc.cancer.gov/projects).
+The Projects Page provides an overview of all harmonized data available in the Genomic Data Commons, organized by project. It also provides filtering, navigation, and advanced visualization features that allow users to identify and browse projects of interest. Users can access the [Projects Page](https://portal.gdc.cancer.gov/projects) from the GDC Data Portal Home page or from the Data Portal toolbar.
On the left, a panel of facets allow users to apply filters to find projects of interest. When facet filters are applied, the table and visualizations on the right are updated to display only the matching projects. When no filters are applied, all projects are displayed.
-The right side of this page displays a few visualizations of the data (Top Mutated Genes in Selected Projects and Case Distribution per Project). Below these graphs is a table that contains a list of projects and select details about each project, such as the number of cases and data files. The Graph tab provides a visual representation of this information.
+The right side of the Projects Page displays a few visualizations of the data (Top Mutated Genes in Selected Projects and Case Distribution per Project). Below these graphs is a table that contains a list of projects and select details about each project, such as the number of cases and data files. The Graph tab provides a visual representation of this information.
[![Projects Page, Main Window (Table View)](images/gdc-data-portal-project-page.png)](images/gdc-data-portal-project-page.png "Click to see the full image.")
### Visualizations
-[![Projects Visualizations)](images/gdc-projects-visualizations.png)](images/gdc-projects-visualizations.png "Click to see the full image.")
+[![Projects Visualizations)](images/gdc_project_visualizations2.png)](images/gdc_project_visualizations2.png "Click to see the full image.")
#### Top Mutated Cancer Genes in Selected Projects
-This dynamically generated bar graph shows the 20 genes with the most mutations across all projects. The genes are filtered by those that are part of the Cancer Gene Census and that have the following types of mutations: `missense_variant, frameshift_variant, start_lost, stop_lost, initiator_codon_variant, and stop_gained`. The bars represent the frequency of each mutation and is broken down into different colored segments by project and disease type. The graphic is updated as filters are applied for projects, programs, disease types, and data categories available in the project. Note, that due the these filters the number of cases displayed here will be less that the total number of cases per project.
+This dynamically generated bar graph shows the 20 genes with the most mutations across all projects. The genes are filtered by those that are part of the Cancer Gene Census and that have the following types of mutations: `missense_variant`, `frameshift_variant`, `start_lost`, `stop_lost`, `initiator_codon_variant`, and `stop_gained`. The bars represent the frequency of mutations per gene and is broken down into different colored segments by project. The graphic is updated as filters are applied for projects, programs, disease types, and data categories available in the project.
-Hovering the cursor over each bar will display information about the number of cases affected by the disease type and clicking on each bar will launch the Gene Summary Page page for the gene associated with the mutation.
+> __Note:__ Due to these filters, the number of cases displayed here will be less that the total number of cases per project.
+
+Hovering the cursor over each bar will display information about the number of cases affected by the disease type and clicking on each bar will launch the [Gene Summary Page](Exploration.md#gene-summary-page) for the gene associated with the mutation.
Users can toggle the Y-Axis of this bar graph between a percentage or raw number of cases affected.
#### Case Distribution per Project
-A pie chart displays the relative number of cases for each primary site (inner circle), which is further divided by project (outer circle). Hovering the cursor over each portion of the graph will display the primary site or project with the number of associated cases. Filtering projects at the left panel will update the pie chart.
-
+A pie chart displays the relative number of cases for each project. Hovering the cursor over each portion of the graph will display the project with the number of associated cases. Filtering projects at the left panel will update the pie chart.
### Projects Table
@@ -36,11 +36,11 @@ The `Table` tab lists projects by Project ID and provides additional information
[![Projects Table)](images/gdc-projects-table-view.png)](images/gdc-data-portal-project-page.png "Click to see the full image.")
-The table provides links to Project Summary pages in the Project ID column. Columns with file and case counts include links to open the corresponding files or cases in [Repository Page](Repository.md).
+The table provides links to [Project Summary Pages](Projects.md#project-summary-page) in the Project ID column. Columns with file and case counts include links to open the corresponding files or cases in [Repository Page](Repository.md).
### Projects Graph
-The `Graph` tab contains an interactive view of information in the Table tab. The numerical values in Case Count, File Count, and File Size columns are represented by bars of varying length according to size. These columns are sorted independently in descending order. Mousing over an element of the graph connects it to associated elements in other columns, including Project ID and Primary Site
+The `Graph` tab contains an interactive view of information in the Table tab. The numerical values in Case Count, File Count, and File Size columns are represented by bars of varying length according to size. These columns are sorted independently in descending order. Mousing over an element of the graph connects it to associated elements in other columns, including Project ID and major Primary Sites.
[![Graph Mouseover](images/gdc-table-graph-mouse-over.png)](images/gdc-table-graph-mouse-over.png "Click to see the full image.")
@@ -52,18 +52,18 @@ Like the projects table, the graph will reflect any applied facet filters.
Facets represent properties of the data that can be used for filtering. The facets panel on the left allows users to filter the projects presented in the Table and Graph tabs as well as visualizations.
-[![Panel with Applied Filters](images/gdc-data-portal-project-page-facets.png)](images/gdc-data-portal-project-page-facets.png "Click to see the full image.")
+[![Panel with Applied Filters](images/gdc-data-portal-project-page-facets2.png)](images/gdc-data-portal-project-page-facets2.png "Click to see the full image.")
Users can filter by the following facets:
-* __Project__: Individual project ID
-* __Primary Site__: Anatomical site of the cancer under investigation or review
-* __Program__: Research program that the project is part of
-* __Disease Type__: Type of cancer studied
-* __Data Category__: Type of data available in the project
-* __Experimental Strategy__: Experimental strategies used for molecular characterization of the cancer
+* __Project__: Individual project ID.
+* __Primary Site__: Anatomical site of the cancer under investigation or review.
+* __Program__: Research program that the project is part of.
+* __Disease Type__: Type of cancer studied.
+* __Data Category__: Type of data available in the project.
+* __Experimental Strategy__: Experimental strategies used for molecular characterization of the cancer.
-Filters can be applied by selecting values of interest in the available facets, for example "WXS" and "RNA-Seq" in the "Experimental Strategy" facet and "Brain" in the "Primary Site" facet. When facet filters are applied, the Table and Graph tabs are updated to display matching projects, and the banner above the tabs summarizes the applied filters. The banner allows the user to click on filter elements to remove the associated filters, and includes a link to view the matching cases and files.
+Filters can be applied by selecting values of interest in the available facets, for example "WXS" and "RNA-Seq" in the "Experimental Strategy" facet and "Brain" in the "Primary Site" facet. When facet filters are applied, the Table and Graph tabs are updated to display matching projects, and the banner above the tabs summarizes the applied filters. The banner allows the user to click on filter elements to remove the associated filters and includes a link to view the matching cases and files.
[![Panel with Applied Filters](images/panel-with-applied-filters.png)](images/panel-with-applied-filters.png "Click to see the full image.")
@@ -71,13 +71,13 @@ For information on how to use facet filters, see [Getting Started](Getting_Start
## Project Summary Page
-Each project has a summary page that provides an overview of all available cases, files, and annotations available. Clicking on the numbers in the summary table will display the corresponding data.
+Each project has a Summary Page that provides an overview of all available cases, files, and annotations available. Clicking on the numbers in the summary table will display the corresponding data.
[![Project Summary Page](images/gdc-project-entity-page_v3.png)](images/gdc-project-entity-page_v2.png "Click to see the full image.")
-Three download buttons in the top right corner of the screen allow the user to download the entire project dataset, along with the associated project metadata:
+Four buttons in the top right corner of the screen allow the user to explore or download the entire project dataset, along with the associated project metadata:
-* __Explore Project Data__: Opens Exploration page with summary project information.
-* __Download Biospecimen__: Downloads biospecimen metadata associated with all cases in the project in either TSV or JSON format.
-* __Download Clinical__: Downloads clinical metadata about all cases in the project in either TSV or JSON format.
-* __Download Manifest__: Downloads a manifest for all data files available in the project. The manifest can be used with the GDC Data Transfer Tool to download the files.
+* __Explore Project Data__: Opens Exploration Page with summary project information.
+* __Biospecimen__: Downloads biospecimen metadata associated with all cases in the project in either TSV or JSON format.
+* __Clinical__: Downloads clinical metadata about all cases in the project in either TSV or JSON format.
+* __Manifest__: Downloads a manifest for all data files available in the project. The manifest can be used with the GDC Data Transfer Tool to download the files.
\ No newline at end of file
diff --git a/docs/Data_Portal/Users_Guide/Repository.md b/docs/Data_Portal/Users_Guide/Repository.md
index efc4d3e30..cf839c536 100644
--- a/docs/Data_Portal/Users_Guide/Repository.md
+++ b/docs/Data_Portal/Users_Guide/Repository.md
@@ -1,20 +1,18 @@
# Repository
-## Summary
-
-The Repository Page is the primary method of accessing data in the GDC Data Portal. It provides an overview of all cases and files available in the GDC and offers users a variety of filters for identifying and browsing cases and files of interest. Users can access the Repository Page from the GDC Data Portal front page, from the Data Portal toolbar, or directly at [https://portal.gdc.cancer.gov/repository](https://portal.gdc.cancer.gov/repository).
+The Repository Page is the primary method of accessing data in the GDC Data Portal. It provides an overview of all cases and files available in the GDC and offers users a variety of filters for identifying and browsing cases and files of interest. Users can access the [Repository Page](https://portal.gdc.cancer.gov/repository) from the GDC Data Portal Home Page or from the Data Portal toolbar.
## Filters / Facets
On the left, a panel of data facets allows users to filter cases and files using a variety of criteria. If facet filters are applied, the tabs on the right will display information about matching cases and files. If no filters are applied, the tabs on the right will display information about all available data.
On the right, two tabs contain information about available data:
-* *`Files` tab* provides a list of files, select information about each file, and links to individual file detail pages.
-* *`Cases` tab* provides a list of cases, select information about each case, and links to individual case summary pages
+* `Files` tab provides a list of files, select information about each file, and links to [individual file detail pages](#file-summary-page).
+* `Cases` tab provides a list of cases, select information about each case, and links to [individual case summary pages](Exploration.md#case-summary-page).
The banner above the tabs on the right displays any active facet filters and provides access to advanced search.
-The top of the Repository Page contains a few summary pie charts for Primary Sites, Projects, Disease Type, Gender, and Vital Status. These reflect all available data or, if facet filters are applied, only the data that matches the filters. Clicking on a specific slice in a pie chart, or on a number in a table, applies corresponding facet filters.
+The top of the Repository Page, in the "Files" tab, contains a few summary pie charts for Primary Sites, Projects, Data Category, Data Type, and Data Format. These reflect all available data or, if facet filters are applied, only the data that matches the filters. Clicking on a specific slice in a pie chart, or on a number in a table, applies corresponding facet filters. The scope of these pie chart will change depending on whether you have the "Files" tab or the "Cases" tab selected.
[![Data View](images/gdc-data-portal-repository-view_v2.png)](images/gdc-data-portal-repository-view_v2.png "Click to see the full image.")
@@ -22,9 +20,7 @@ The top of the Repository Page contains a few summary pie charts for Primary Sit
Facets represent properties of the data that can be used for filtering. The facets panel on the left allows users to filter the cases and files presented in the tabs on the right.
-The facets panel is divided into two tabs, with the Files tab containing facets pertaining to data files and experimental strategies, while the Cases tab containing facets pertaining to the cases and biospecimen information. Users can apply filters in both tabs simultaneously. The applied filters will be displayed in the banner above the tabs on the right, with the option to open the filter in [Advanced Search](Advanced_Search.md) to further refine the query.
-
-The [Getting Started](Getting_Started.md#facet-filters) section provides instructions on using facet filters. In the following example, a filter from the Cases tab ("primary site") and filters from the Files tab ("data category", "experimental strategy") are both applied:
+The facets panel is divided into two tabs, with the `Files` tab containing facets pertaining to data files and experimental strategies, while the `Cases` tab containing facets pertaining to the cases and biospecimen information. Users can apply filters in both tabs simultaneously. The applied filters will be displayed in the banner above the tabs on the right, with the option to open the filter in [Advanced Search](Advanced_Search.md) to further refine the query.
[![Facet Filters Applied in Data View](images/data-view-with-facet-filters-applied_v2.png)](images/data-view-with-facet-filters-applied_v2.png "Click to see the full image.")
@@ -44,9 +40,9 @@ The default set of facets is listed below.
*Cases* facets tab:
* __Case__: Specify individual cases using submitter ID (barcode) or UUID.
-* __Case Submitter ID Prefix__: Search for cases using a part (prefix) of the submitter ID (barcode).
+* __Case ID__: Search for cases using a part (prefix) of the submitter ID (barcode).
* __Primary Site__: Anatomical site of the cancer under investigation or review.
-* __Cancer Program__: A cancer research program, typically consisting of multiple focused projects.
+* __Program__: A cancer research program, typically consisting of multiple focused projects.
* __Project__: A cancer research project, typically part of a larger cancer research program.
* __Disease Type__: Type of cancer studied.
* __Gender__: Gender of the patient.
@@ -58,119 +54,126 @@ The default set of facets is listed below.
### Adding Custom Facets
-The Repository Page provides access to additional data facets beyond those listed above. Facets corresponding to additional properties listed in the [GDC Data Dictionary](../../Data_Dictionary/index.md) can be added using the "add a filter" links available at the top of the Cases and Files facet tabs:
+The Repository Page provides access to additional data facets beyond the automatically listed group filters. Facets corresponding to additional properties listed in the [GDC Data Dictionary](../../Data_Dictionary/index.md) can be added using the "Add a Filter" link available at the top of the `Cases` and `Files` facet tabs:
[![Add a Facet](images/gdc-data-portal-data-add-facet.png)](images/gdc-data-portal-data-add-facet.png "Click to see the full image.")
-The links open a search window that allows the user to find an additional facet by name or description. Not all facets have values available for filtering; checking the "Only show fields with values" checkbox will limit the search results to only those that do. Selecting a facet from the list of search results below the search box will add it to the facets panel.
+The link opens a search window that allows the user to find an additional facet by name or description. Not all facets have values available for filtering; checking the "Only show fields with values" checkbox will limit the search results to only those that do. When selecting a facet from the list of search results below the search box will add it to the facets panel.
[![Search for a Facet](images/gdc-data-portal-data-facet-search.png)](images/gdc-data-portal-data-facet-search.png "Click to see the full image.")
-Newly added facets will show up at the top of the facets panel and can be removed individually by clicking on the red cross to the right of the facet name. The default set of facets can be restored by clicking "Reset".
+Newly added facets will show up at the top of the facets panel and can be removed individually by clicking on the "__x__" to the right of the facet name. The default set of facets can be restored by clicking "Reset".
[![Customize Facet](images/gdc-data-portal-data-facet-tumor_stage.png)](images/gdc-data-portal-data-facet-tumor_stage.png "Click to see the full image.")
-## Results
-### Files List
+## Annotations View
-The Files tab on the right provides a list of available files and select information about each file. If facet filters are applied, the list includes only matching files. Otherwise, the list includes all data files available in the GDC Data Portal.
+The Annotations View provides an overview of the available annotations and allows users to browse and filter the annotations based on a number of annotation properties (facets), such as the type of entity the annotation is attached to or the annotation category. This page can be found by clicking on the [Browse Annotations](https://portal.gdc.cancer.gov/annotations) link, located at the top right of the repository page.
-[![Files Tab](images/gdc-data-portal-data-files.png)](images/gdc-data-portal-data-files.png "Click to see the full image.")
+[![Annotations View](images/Browse_Annotations.png)](images/Browse_Annotations.png "Click to see the full image.")
-The *File Name* column includes links to [file detail pages](#file-detail-page) where the user can learn more about each file.
+The view presents a list of annotations in tabular format on the right, and a facet panel on the left that allows users to filter the annotations displayed in the table. If facet filters are applied, the tabs on the right will display only the matching annotations. If no filters are applied, the tabs on the right will display information about all available annotations.
-Users can add individual file(s) to the file cart using the cart button next to each file. Alternatively, all files that match the current facet filters can be added to the cart using the menu in the top left corner of the table:
+[![Annotations View](images/gdc-data-portal-annotations.png)](images/gdc-data-portal-annotations.png "Click to see the full image.")
-[![Files Tab](images/gdc-data-portal-data-files-add-cart.png)](images/gdc-data-portal-data-files-add-cart.png "Click to see the full image.")
+Clicking on an annotation ID in the annotations list will take the user to the Annotation Summary Page. The Annotation Summary Page provides more details about a specific annotation.
-### Cases List
+[![Annotation Entity Page](images/annotations-entity-page.png)](images/annotations-entity-page.png "Click to see the full image.")
-The Cases tab on the right provides a list of available cases and select information about each case. If facet filters are applied, the list includes only matching cases. Otherwise, the list includes all cases available in the GDC Data Portal.
+## Results
-[![Cases Tab](images/gdc-data-portal-data-cases_v3.png)](images/gdc-data-portal-data-cases_v3.png "Click to see the full image.")
+### Navigation
+
+After utilizing the Repository Page to narrow down a specific set of cases, users can choose to continue to explore the mutations and genes affected by these cases by clicking the `View Cases in Exploration` button as shown in the image below.
+
+[![Exploration File Navigation](images/gdc-view-in-exploration_v3.png)](images/gdc-view-in-exploration_v3.png "Click to see the full image.")
-The list includes links to [case summary pages](#case-summary-page) in the *Case UUID* column, the Submitter ID (i.e. TCGA Barcode), and counts of the available file types for each case. Clicking on a count will apply facet filters to display the corresponding files.
+Clicking this button will navigate the users to the [Exploration Page](Exploration.md), filtered by the cases within the cohort.
-The list also includes a shopping cart button, allowing the user to add all files associated with a case to the file cart for downloading at a later time:
+### Files List
-[![Cases Tab, Add to Cart](images/gdc-data-portal-data-case-add-cart.png)](images/gdc-data-portal-data-case-add-cart.png "Click to see the full image.")
+The `Files` tab on the right provides a list of available files and select information about each file. If facet filters are applied, the list includes only matching files. Otherwise, the list includes all data files available in the GDC Data Portal.
+[![Files Tab](images/gdc-data-portal-data-files.png)](images/gdc-data-portal-data-files.png "Click to see the full image.")
-## Navigation
+The "*File Name*" column includes links to [File Summary Pages](#file-summary-page) where the user can learn more about each file.
-After utilizing the Repository Page to narrow down a specific set of cases, users can continue to explore the mutations and genes affected by these cases by clicking the `View Files in Repository` button as shown in the image below.
+Users can add individual file(s) to the [cart](Cart.md) using the cart button next to each file. Alternatively, all files that match the current facet filters can be added to the cart using the menu in the top left corner of the table:
-[![Exploration File Navigation](images/gdc-view-in-exploration_v3.png)](images/gdc-view-in-exploration_v3.png "Click to see the full image.")
+[![Files Tab](images/gdc-data-portal-data-files-add-cart.png)](images/gdc-data-portal-data-files-add-cart.png "Click to see the full image.")
-Clicking this button will navigate the users to the Exploration Page, filtered by the cases within the cohort.
+## File Summary Page
-## Case Summary Page
+The File Summary page provides information about a data file, including file properties like size, MD5 checksum, and data format; information on the type of data included; links to the associated cases and biospecimen; and information about how the data file was generated or processed.
+
+The page also includes buttons to download the file, add it to the file [cart](Cart.md), or (for BAM files) utilize the BAM slicing function.
+
+[![Files Detail Page](images/gdc-data-portal-files-entity-page.png)](images/gdc-data-portal-files-entity-page.png "Click to see the full image.")
+
+In the lower section of the screen, the following tables provide more details about the file and its characteristics:
-The Case Summary page displays case details including the project and disease information, data files that are available for that case, and the experimental strategies employed. A button in the top-right corner of the page allows the user to add all files associated with the case to the file cart.
+* __Associated Cases / Biospecimen__: List of cases or biospecimen the file is directly attached to.
+* __Analysis and Reference Genome__: Information on the workflow and reference genome used for file generation.
+* __Read Groups__: Information on the read groups associated with the file.
+* __Metadata Files__: Experiment metadata, run metadata and analysis metadata associated with the file.
+* __Downstream Analysis Files__: List of downstream analysis files generated by the file.
+* __File Versions__: List of all versions of the file.
-[![Case Page](images/gdc-case-entity-page.png)](images/gdc-case-entity-page.png "Click to see the full image.")
-### Clinical and Biospecimen Information
+[![Files Entity Page](images/gdc-data-portal-files-entity-page-part2_v2.png)](images/gdc-data-portal-files-entity-page-part2_v2.png "Click to see the full image.")
-The page also provides clinical and biospecimen information about that case. Links to export clinical and biospecimen information in JSON format are provided.
+>**Note**: *The Legacy Archive* will not display the "Workflow, Reference Genome and Read Groups" sections (these sections are applicable to the GDC harmonization pipeline only). However, it may provide information on archives and metadata files like MAGE-TABs and SRA XMLs. For more information, please refer to the section [Legacy Archive](Legacy_Archive.md).
-[![Case Page, Clinical and Biospecimen](images/gdc-case-clinical-biospecimen_v3.png)](images/gdc-case-clinical-biospecimen_v3.png "Click to see the full image.")
+### BAM Slicing
-For clinical records that support multiple records of the same type (Diagnoses, Family Histories, or Exposures), a UUID of the record is provided on the left hand side of the corresponding tab, allowing the user to select the entry of interest.
+BAM file Summary Pages have a "BAM Slicing" button. This function allows the user to specify a region of a BAM file for download. Clicking on it will open the BAM Slicing window:
-### Biospecimen Search
+[![BAM Slicing Window](images/gdc-data-portal-bam-slicing.png)](images/gdc-data-portal-bam-slicing.png "Click to see the full image.")
-A search filter just below the biospecimen section can be used to find and filter biospecimen data. The wildcard search will highlight entities in the tree that match the characters typed. This will search both the case submitter ID, as well as the additional metadata for each entity. For example, searching 'Primary Tumor' will highlight samples that match that type.
+During preparation of the slice, the icon on the BAM Slicing button will be spinning, and the file will be offered for download to the user as soon as it is ready.
-[![Biospecimen Search](images/gdc-case-biospecimen-search_v2.png)](images/gdc-case-biospecimen-search_v2.png "Click to see the full image.")
+### Cases List
-### Most Frequent Somatic Mutations
+The `Cases` tab on the right provides a list of available cases and select information about each case. If facet filters are applied, the list includes only matching cases. Otherwise, the list includes all cases available in the GDC Data Portal.
-The case entity page also lists the mutations found in that particular case.
+[![Cases Tab](images/gdc-data-portal-data-cases_v3.png)](images/gdc-data-portal-data-cases_v3.png "Click to see the full image.")
-[![Case Page](images/gdc-case-entity-mfm.png)](images/gdc-case-entity-mfm.png "Click to see the full image.")
+From the left side, the list starts with a shopping cart icon, allowing the user to add all files associated with a case to the [file cart](Cart.md) for downloading at a later time. The following columns in the list includes links to [Case Summary Pages](Exploration.md#case-summary-page) in the *Case UUID* column, the Submitter ID (i.e. TCGA Barcode), and counts of the available file types for each case. Clicking on a count will apply facet filters to display the corresponding files. On the last column, there are image slide icons and a number that indicate whether there are slide images available and how many.
-The table lists the following information for each mutation
+## Image Viewer
-* __DNA Change:__ The chromosome and starting coordinates of the mutation are displayed along with the nucleotide differences between the reference and tumor allele
-* __Type:__ A general classification of the mutation
-* __Consequences:__ The effects the mutation has on the gene coding for a protein (i.e. synonymous, missense, non-coding transcript)
-* __# Affected Cases in Project:__ The number of affected cases, expressed as number across all mutations within the Project
-* __# Affected Cases Across GDC:__ The number of affected cases, expressed as number across all projects. Choosing the arrow next to the percentage will expand the selection with a breakdown of each affected project
-* __Impact (VEP):__ A subjective classification of the severity of the variant consequence. This information comes from the [Ensembl VEP](http://www.ensembl.org/info/genome/variation/predicted_data.html). The categories are:
- - __HIGH (H)__: The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay
- - __MODERATE (M)__: A non-disruptive variant that might change protein effectiveness
- - __LOW (L)__: Assumed to be mostly harmless or unlikely to change protein behavior
- - __MODIFIER (MO)__: Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact
+The Image Viewer allows users to visualize tissue and diagnostic slide images.
-Clicking on the `Open in Exploration` button at the top right of this section will navigate the user to the Exploration page, filtered on this case.
+[![Image Viewer](images/Image_viewer_browser.png)](images/Image_viewer_browser.png "Click to see the full image.")
-## File Summary Page
+### How to Access the Image Viewer
-The File Summary page provides information a data file, including file properties like size, md5 checksum, and data format; information on the type of data included; links to the associated case and biospecimen; and information about how the data file was generated or processed.
+* __Repository Page__: From the main search on the Repository Page by clicking on the "View images" button. It will display the tissue slide images of all the cases resulting from the query.
-The page also includes buttons to download the file, add it to the file cart, or (for BAM files) utilize the BAM slicing function.
+[![Image Viewer](images/Image_Viewer_from_Repository.png)](images/Image_Viewer_from_Repository.png "Click to see the full image.")
-[![Files Detail Page](images/gdc-data-portal-files-entity-page.png)](images/gdc-data-portal-files-entity-page.png "Click to see the full image.")
+* __Case Table in Repository Page__: Click on the image viewer icon in the Case table. It will display in the image viewer all the tissue slide images attached to the Case.
-In the lower section of the screen, the following tables provide more details about the file and its characteristics:
+[![Cases Tab](images/gdc-data-portal-data-cases_v3.png)](images/gdc-data-portal-data-cases_v3.png "Click to see the full image.")
-* __Associated Cases / Biospecimen__: List of Cases or biospecimen the file is directly attached to.
-* __Analysis and Reference Genome__: Information on the workflow and reference genome used for file generation.
-* __Read Groups__: Information on the read groups associated with the file.
-* __Metadata Files__: Experiment metadata, run metadata and analysis metadata associated with the file.
-* __Downstream Analysis Files__: List of downstream analysis files generated by the file.
-* __File Versions__: List of all versions of the file.
+* __Case Summary Page:__ Selecting a Case ID in the Repository Cases table will direct the user to the [Case Summary Page](Exploration.md#case-summary-page). For cases with images, the Image Viewer icon will appear in the Case Summary section or in the Biospecimen - Slides details section. Clicking on the Image Viewer icon will display the Image Viewer for the slide images attached to the case.
+ [![Image Viewer](images/Image_viewer_case_summary.png)](images/Image_viewer_case_summary.png "Click to see the full image.")
+ [![Image Viewer](images/Image_viewer_case_slide_section.png)](images/Image_viewer_case_slide_section.png "Click to see the full image.")
-[![Files Entity Page](images/gdc-data-portal-files-entity-page-part2_v2.png)](images/gdc-data-portal-files-entity-page-part2_v2.png "Click to see the full image.")
+* __The Image File Page__: You can visualize the slide image directly in the File Summary Page by selecting an image file in the Repository's files table.
-**Note**: *The Legacy Archive* will not display "Workflow, Reference Genome and Read Groups" sections (these sections are applicable to the GDC harmonization pipeline only). However it may provide information on Archives and metadata files like MAGE-TABs and SRA XMLs. For more information, please refer to the section [Legacy Archive](Legacy_Archive.md).
+[![Image Viewer](images/Repository_select_image.png)](images/Repository_select_image.png "Click to see the full image.")
-### BAM Slicing
+[![Image Viewer](images/Image_viewer_File_entity.png)](images/Image_viewer_File_entity.png "Click to see the full image.")
-BAM file detail pages have a "BAM Slicing" button. This function allows the user to specify a region of a BAM file for download. Clicking on it will open the BAM slicing window:
+### Image Viewer Features
+In the image viewer, a user can:
-[![BAM Slicing Window](images/gdc-data-portal-bam-slicing.png)](images/gdc-data-portal-bam-slicing.png "Click to see the full image.")
+* Zoom in and zoom out by clicking on + and - icons.
+* Reset to default display by clicking on the Home icon.
+* Display the image in full screen mode by clicking on the Expand icon.
+* View the slide detail by clicking on "Details" button.
+* Selecting the area of interest with the thumbnail at the top-right corner.
-During preparation of the slice, the icon on the BAM Slicing button will be spinning, and the file will be offered for download to the user as soon as ready.
+[![Image Viewer](images/Image_viewer_features.png)](images/Image_viewer_features.png "Click to see the full image.")
\ No newline at end of file
diff --git a/docs/Data_Portal/Users_Guide/images/Browse_Annotations.png b/docs/Data_Portal/Users_Guide/images/Browse_Annotations.png
new file mode 100644
index 000000000..46c62f331
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/Browse_Annotations.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/Exploration-Case-Example_v3.png b/docs/Data_Portal/Users_Guide/images/Exploration-Case-Example_v3.png
index e28cf845c..d94b465b3 100644
Binary files a/docs/Data_Portal/Users_Guide/images/Exploration-Case-Example_v3.png and b/docs/Data_Portal/Users_Guide/images/Exploration-Case-Example_v3.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/Exploration-Gene-Example2.png b/docs/Data_Portal/Users_Guide/images/Exploration-Gene-Example2.png
new file mode 100644
index 000000000..1b8894f75
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/Exploration-Gene-Example2.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/Exploration-Gene-Example_v2.png b/docs/Data_Portal/Users_Guide/images/Exploration-Gene-Example_v2.png
new file mode 100644
index 000000000..b0171dead
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/Exploration-Gene-Example_v2.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/Exploration-Mutation-Example.png b/docs/Data_Portal/Users_Guide/images/Exploration-Mutation-Example.png
index 51d3da719..ed53656b4 100644
Binary files a/docs/Data_Portal/Users_Guide/images/Exploration-Mutation-Example.png and b/docs/Data_Portal/Users_Guide/images/Exploration-Mutation-Example.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/Exploration-Oncogrid-Color-Picker.png b/docs/Data_Portal/Users_Guide/images/Exploration-Oncogrid-Color-Picker.png
new file mode 100644
index 000000000..052c85c57
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/Exploration-Oncogrid-Color-Picker.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/Exploration-Oncogrid-Example_v2.png b/docs/Data_Portal/Users_Guide/images/Exploration-Oncogrid-Example_v2.png
new file mode 100644
index 000000000..0689672e4
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/Exploration-Oncogrid-Example_v2.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/Exploration-View-Files_v3.png b/docs/Data_Portal/Users_Guide/images/Exploration-View-Files_v3.png
index c9dfb63c3..ac841e590 100644
Binary files a/docs/Data_Portal/Users_Guide/images/Exploration-View-Files_v3.png and b/docs/Data_Portal/Users_Guide/images/Exploration-View-Files_v3.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/GDC-ExplorationSet-Cohort_v2.png b/docs/Data_Portal/Users_Guide/images/GDC-ExplorationSet-Cohort_v2.png
index a86e40e53..1b7dc32c5 100644
Binary files a/docs/Data_Portal/Users_Guide/images/GDC-ExplorationSet-Cohort_v2.png and b/docs/Data_Portal/Users_Guide/images/GDC-ExplorationSet-Cohort_v2.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/GDC-Gene-CancerDist.png b/docs/Data_Portal/Users_Guide/images/GDC-Gene-CancerDist.png
index 92a42d89a..a27de3581 100644
Binary files a/docs/Data_Portal/Users_Guide/images/GDC-Gene-CancerDist.png and b/docs/Data_Portal/Users_Guide/images/GDC-Gene-CancerDist.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/GDC-Gene-CancerDist_v2.png b/docs/Data_Portal/Users_Guide/images/GDC-Gene-CancerDist_v2.png
new file mode 100644
index 000000000..0bf68368e
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/GDC-Gene-CancerDist_v2.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/GDC-Gene-MFM.png b/docs/Data_Portal/Users_Guide/images/GDC-Gene-MFM.png
index 51b224381..849785711 100644
Binary files a/docs/Data_Portal/Users_Guide/images/GDC-Gene-MFM.png and b/docs/Data_Portal/Users_Guide/images/GDC-Gene-MFM.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/GDC-Gene-ProteinGraph.png b/docs/Data_Portal/Users_Guide/images/GDC-Gene-ProteinGraph.png
index f58906822..c8801011c 100644
Binary files a/docs/Data_Portal/Users_Guide/images/GDC-Gene-ProteinGraph.png and b/docs/Data_Portal/Users_Guide/images/GDC-Gene-ProteinGraph.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/GDC-Gene-Summary.png b/docs/Data_Portal/Users_Guide/images/GDC-Gene-Summary.png
index 2b330ea8a..9da5f9994 100644
Binary files a/docs/Data_Portal/Users_Guide/images/GDC-Gene-Summary.png and b/docs/Data_Portal/Users_Guide/images/GDC-Gene-Summary.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/GDC-Mutation-CancerDist.png b/docs/Data_Portal/Users_Guide/images/GDC-Mutation-CancerDist.png
index b737f2291..8d4b97a9a 100644
Binary files a/docs/Data_Portal/Users_Guide/images/GDC-Mutation-CancerDist.png and b/docs/Data_Portal/Users_Guide/images/GDC-Mutation-CancerDist.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/GDC-Mutation-Consequences.png b/docs/Data_Portal/Users_Guide/images/GDC-Mutation-Consequences.png
index 3fab97e96..6fad980d3 100644
Binary files a/docs/Data_Portal/Users_Guide/images/GDC-Mutation-Consequences.png and b/docs/Data_Portal/Users_Guide/images/GDC-Mutation-Consequences.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/GDC-Mutation-ProteinGraph.png b/docs/Data_Portal/Users_Guide/images/GDC-Mutation-ProteinGraph.png
index a49fd4f89..be6df6554 100644
Binary files a/docs/Data_Portal/Users_Guide/images/GDC-Mutation-ProteinGraph.png and b/docs/Data_Portal/Users_Guide/images/GDC-Mutation-ProteinGraph.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/GDC-Mutation-Summary.png b/docs/Data_Portal/Users_Guide/images/GDC-Mutation-Summary.png
index 617a54f6e..f55f5ca89 100644
Binary files a/docs/Data_Portal/Users_Guide/images/GDC-Mutation-Summary.png and b/docs/Data_Portal/Users_Guide/images/GDC-Mutation-Summary.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/Repository_select_image.png b/docs/Data_Portal/Users_Guide/images/Repository_select_image.png
new file mode 100644
index 000000000..d390db17b
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/Repository_select_image.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/gdc-case-entity-mfm.png b/docs/Data_Portal/Users_Guide/images/gdc-case-entity-mfm.png
index 2a6569a71..3b5363eea 100644
Binary files a/docs/Data_Portal/Users_Guide/images/gdc-case-entity-mfm.png and b/docs/Data_Portal/Users_Guide/images/gdc-case-entity-mfm.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/gdc-data-portal-download-cart_v2.png b/docs/Data_Portal/Users_Guide/images/gdc-data-portal-download-cart_v2.png
new file mode 100644
index 000000000..81f2f0f1b
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/gdc-data-portal-download-cart_v2.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/gdc-data-portal-project-page-facets2.png b/docs/Data_Portal/Users_Guide/images/gdc-data-portal-project-page-facets2.png
new file mode 100644
index 000000000..a2f961f0b
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/gdc-data-portal-project-page-facets2.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/gdc-data-portal-project-page.png b/docs/Data_Portal/Users_Guide/images/gdc-data-portal-project-page.png
index 89b3b8ad4..d0d2bbab9 100644
Binary files a/docs/Data_Portal/Users_Guide/images/gdc-data-portal-project-page.png and b/docs/Data_Portal/Users_Guide/images/gdc-data-portal-project-page.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/gdc-input-set_v2.png b/docs/Data_Portal/Users_Guide/images/gdc-input-set_v2.png
index 8eed614b4..b9e950ede 100644
Binary files a/docs/Data_Portal/Users_Guide/images/gdc-input-set_v2.png and b/docs/Data_Portal/Users_Guide/images/gdc-input-set_v2.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/gdc-kaplan-meier-estimator.svg b/docs/Data_Portal/Users_Guide/images/gdc-kaplan-meier-estimator.svg
new file mode 100644
index 000000000..8bf37b3a7
--- /dev/null
+++ b/docs/Data_Portal/Users_Guide/images/gdc-kaplan-meier-estimator.svg
@@ -0,0 +1,53 @@
+
+
+
diff --git a/docs/Data_Portal/Users_Guide/images/gdc-kaplan-meier-estimator2.png b/docs/Data_Portal/Users_Guide/images/gdc-kaplan-meier-estimator2.png
new file mode 100644
index 000000000..1f756fe09
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/gdc-kaplan-meier-estimator2.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/gdc-table-graph-mouse-over.png b/docs/Data_Portal/Users_Guide/images/gdc-table-graph-mouse-over.png
index 06d02ffed..dd40e17ef 100644
Binary files a/docs/Data_Portal/Users_Guide/images/gdc-table-graph-mouse-over.png and b/docs/Data_Portal/Users_Guide/images/gdc-table-graph-mouse-over.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/gdc_case_biospecimen_search_v3.png b/docs/Data_Portal/Users_Guide/images/gdc_case_biospecimen_search_v3.png
new file mode 100644
index 000000000..82b5414bb
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/gdc_case_biospecimen_search_v3.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/gdc_project_visualizations2.png b/docs/Data_Portal/Users_Guide/images/gdc_project_visualizations2.png
new file mode 100644
index 000000000..e018a9aaa
Binary files /dev/null and b/docs/Data_Portal/Users_Guide/images/gdc_project_visualizations2.png differ
diff --git a/docs/Data_Portal/Users_Guide/images/mutation-set-filter.png b/docs/Data_Portal/Users_Guide/images/mutation-set-filter.png
index 2fc7c861c..6e8d1c95a 100644
Binary files a/docs/Data_Portal/Users_Guide/images/mutation-set-filter.png and b/docs/Data_Portal/Users_Guide/images/mutation-set-filter.png differ
diff --git a/docs/Data_Submission_Portal/Release_Notes/Data_Submission_Portal_Release_Notes.md b/docs/Data_Submission_Portal/Release_Notes/Data_Submission_Portal_Release_Notes.md
index 1dbdb6073..f397e1470 100644
--- a/docs/Data_Submission_Portal/Release_Notes/Data_Submission_Portal_Release_Notes.md
+++ b/docs/Data_Submission_Portal/Release_Notes/Data_Submission_Portal_Release_Notes.md
@@ -2,6 +2,8 @@
| Version | Date |
|---|---|
+| [v2.2.0](Data_Submission_Portal_Release_Notes.md#release-220) | February 20, 2019 |
+| [v2.1.0](Data_Submission_Portal_Release_Notes.md#release-210) | November 7, 2018 |
| [v2.0.0](Data_Submission_Portal_Release_Notes.md#release-200) | August 23, 2018 |
| [v1.9.0](Data_Submission_Portal_Release_Notes.md#release-190) | May 21, 2018 |
| [v1.8.0](Data_Submission_Portal_Release_Notes.md#release-180) | February 15, 2018 |
@@ -15,6 +17,44 @@
| [v0.3.21](Data_Submission_Portal_Release_Notes.md#release-0321) | January 27, 2016 |
| [v0.2.18.3](Data_Submission_Portal_Release_Notes.md#release-02183) | November 30, 2015 |
+## Release 2.2.0
+
+* __GDC Product__: GDC Data Submission Portal
+* __Release Date__: February 20, 2019
+
+### New Features and Changes
+
+* Renamed the "Request Submission" button to "Request Harmonization" to make the purpose of this action more clear.
+
+### Bugs Fixed Since Last Release
+
+* Fixed the right scroll bar in the records list on the Browse page so that it works in Firefox.
+* Fixed a dead link to the Submission Portal User Guide on the Dashboard.
+
+### Known Issues and Workarounds
+
+* When creating entities in the Submission Portal, occasionally an extra transaction will appear with status error. This does not seem to impact that actual transaction, which is recorded as occurring successfully.
+
+
+## Release 2.1.0
+
+* __GDC Product__: GDC Data Submission Portal
+* __Release Date__: November 7, 2018
+
+### New Features and Changes
+
+* Updated the project columns to include a Release column in addition to the Batch Submit column.
+
+### Bugs Fixed Since Last Release
+
+* Fixed quick search so that projects with a dash in the name will no longer break the search.
+* PO reports will now return the latest data for each project that has completed running.
+
+### Known Issues and Workarounds
+
+* When creating entities in the Submission Portal, occasionally an extra transaction will appear with status error. This does not seem to impact that actual transaction, which is recorded as occurring successfully.
+
+
## Release 2.0.0
* __GDC Product__: GDC Data Submission Portal
diff --git a/docs/Data_Submission_Portal/Users_Guide/Best_Practices.md b/docs/Data_Submission_Portal/Users_Guide/Best_Practices.md
index 7111f6bdc..23e7eee0f 100644
--- a/docs/Data_Submission_Portal/Users_Guide/Best_Practices.md
+++ b/docs/Data_Submission_Portal/Users_Guide/Best_Practices.md
@@ -1,6 +1,6 @@
# Submission Best Practices
-Because of the data types and relationships included in the GDC, data submission can become a complex procedure. The purpose of this section is to present guidelines that will aid in the incorporation and harmonization of submitters' data. Please contact the GDC Help Desk at __support@nci-gdc.datacommons.io__ if you have any questions or concerns regarding a submission project.
+Because of the data types and relationships included in the GDC, data submission can become a complex procedure. The purpose of this section is to present guidelines that will aid in the incorporation and harmonization of submitters' data. Please contact the GDC Help Desk at ____ if you have any questions or concerns regarding a submission project.
## Date Obfuscation
@@ -8,66 +8,54 @@ The GDC is committed to providing accurate and useful information as well as pro
### General Guidelines
-Actual calendar dates are not reported in GDC clinical fields but the lengths of time between events are preserved. Time points are reported based on the number of days since the patient's initial diagnosis. Events that occurred after the initial diagnosis are reported as positive and events that occurred before are reported as negative. Dates are not automatically obfuscated by the GDC validation system and submitters are required to make these changes in their clinical data.
+Actual calendar dates are not reported in GDC clinical fields but the lengths of time between events are preserved. Time points are reported based on the number of days since the patient's initial diagnosis. Events that occurred after the initial diagnosis are reported as positive and events that occurred before are reported as negative. Dates are not automatically obfuscated by the GDC validation system and submitters are required to make these changes in their clinical data. This affects these fields: `days_to_birth`, `days_to_death`, `days_to_last_follow_up`, `days_to_last_known_disease_status`, `days_to_recurrence`, `days_to_treatment`
-| Affected Fields |
-| --- |
-| `days_to_birth` |
-| `days_to_death` |
-| `days_to_last_follow_up` |
-| `days_to_last_known_disease_status` |
-| `days_to_recurrence` |
-| `days_to_treatment` |
+>__Note:__ The day-based fields take leap years into account.
-### Patients Older than 90 Years
+### Patients Older than 90 Years and Clinical Events
Because of the low population number within the demographic of patients over 90 years old, it becomes more likely that patients can potentially be identified by a combination of their advanced age and publicly available clinical data. Because of this, patients over 90 years old are reported as exactly 90 years or 32,872 days old.
-__Note:__ The day-based fields take leap years into account.
-
-### Clinical Events After a Patient Turns 90 Years Old
-
-Clinical events that occur over 32,872 days after an event also have the potential to reveal the age and identity of an individual over the age of 90. Following this, all timelines are capped at 32,872 days. When timelines are capped, the priority should be to shorten the post-diagnosis values to preserve the accuracy of the age of the patient (except for patients who were diagnosed at over 90 years old). Values such as `days_to_death` and `days_to_recurrence` should be compressed before `days_to_birth` is compressed.
+Following this, clinical events that occur over 32,872 days are also capped at 32,872 days. When timelines are capped, the priority should be to shorten the post-diagnosis values to preserve the accuracy of the age of the patient (except for patients who were diagnosed at over 90 years old). Values such as `days_to_death` and `days_to_recurrence` should be compressed before `days_to_birth` is compressed.
### Examples Timelines
__Example 1:__ An 88 year old patient is diagnosed with cancer and dies 13 years later. The `days_to_birth` value is less than 32,872 days, so it can be accurately reported. However, between the initial diagnosis and death, the patient turned 90 years old. Since 32,872 is the maximum, `days_to_death` would be calculated as 32872 - 32142 = 730.
-__Dates__
+>__Dates__
-* _Date of Birth:_ 01-01-1900
-* _Date of Initial Diagnosis:_ 01-01-1988
-* _Date of Death:_ 01-01-2001
+>* _Date of Birth:_ 01-01-1900
+>* _Date of Initial Diagnosis:_ 01-01-1988
+>* _Date of Death:_ 01-01-2001
-__Actual-Values__
+>__Actual-Values__
-* _days_to_birth:_ -32142
-* _days_to_death:_ 4748
+>* _days_to_birth:_ -32142
+>* _days_to_death:_ 4748
-__Obfuscated-Values__
+>__Obfuscated-Values__
-* _days_to_birth:_ -32142
-* _days_to_death:_ 730
+>* _days_to_birth:_ -32142
+>* _days_to_death:_ 730
__Example 2:__ A 98 year old patient is diagnosed with cancer and dies three years later. Because `days_to_X` values are counted from initial diagnosis, days will be at their maximum value of 32,872 upon initial diagnosis. This will compress the later dates and reduce `days_to_birth` to -32,872 and `days_to_death` to zero.
-__Dates__
-
-* _Date of Birth:_ 01-01-1900
-* _Date of Initial Diagnosis:_ 01-01-1998
-* _Date of Death:_ 01-01-2001
+>__Dates__
-__Actual-Values__
+>* _Date of Birth:_ 01-01-1900
+>* _Date of Initial Diagnosis:_ 01-01-1998
+>* _Date of Death:_ 01-01-2001
-* _days_to_birth:_ -35794
-* _days_to_death:_ 1095
+>__Actual-Values__
-__Obfuscated-Values__
+>* _days_to_birth:_ -35794
+>* _days_to_death:_ 1095
-* _days_to_birth:_ -32872
-* _days_to_death:_ 0
+>__Obfuscated-Values__
+>* _days_to_birth:_ -32872
+>* _days_to_death:_ 0
## Submitting Complex Data Model Relationships
@@ -97,23 +85,25 @@ submitted_aligned_reads Alignment.bam Raw Sequencing Data BAM Aligned Reads W
}
```
-## Read groups
+### Read groups
-### Submitting Read Group Names
+#### Submitting Read Group Names
The `read_group` entity requires a `read_group_name` field for submission. If the `read_group` entity is associated with a BAM file, the submitter should use the `@RG` ID present in the BAM header as the `read_group_name`. This is important for the harmonization process and will reduce the possibility of errors.
-### Minimal Read Group Information
+#### Multiple FASTQs from One Read Group
-In addition to the required properties on `read_group` we also recommend submitting `flow_cell_barcode`, `lane_number` and `multiplex_barcode`. This information can be used by our bioinformatics team and data downloaders to construct a `Platform Unit` (`PU`), which is a universally unique identifier that can be used to model various sequencing technical artifacts. More information can be found in the SAM specification (https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf).
+To align reads according to their direction and pair, the GDC requires that unaligned forward and reverse reads are submitted as "submitted_unaligned_reads." When more than one FASTQ exists for a read group direction, the GDC requires that the FASTQ files are concatenated for each direction. In other words, each paired-end read group should be associated with exactly two FASTQ files (submitted_unaligned_reads).
-For projects with library strategies of targeted sequencing or WXS we also require information on the target capture protocol included on `target_capture_kit`
+#### Minimal and Recommended Read Group Information
-If this information is not provided it may cause a delay in the processing of submitted data.
+In addition to the required properties on `read_group` we also recommend submitting `flow_cell_barcode`, `lane_number` and `multiplex_barcode`. This information can be used by our bioinformatics team and data downloaders to construct a `Platform Unit` (`PU`), which is a universally unique identifier that can be used to model various sequencing technical artifacts. More information can be found in the [SAM specification PDF](https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf).
+
+For projects with library strategies of targeted sequencing or WXS we also require information on the target capture protocol included on `target_capture_kit`.
-### Recommended Read Group Information
+If this information is not provided it may cause a delay in the processing of submitted data.
-Additional read group information will benefit data users. Such information can be used by bioinformatics pipelines and will aid understanding and mitigation of batch effects. If available you should also provide as many of the remaining read group properties as possible.
+Additional read group information will benefit data users. Such information can be used by bioinformatics pipelines and will aid understanding and mitigation of batch effects. If available, you should also provide as many of the remaining read group properties as possible.
## Submission File Quality Control
@@ -129,19 +119,19 @@ Target region information is important for DNA-Seq variant calling and filtering
3. How do submitters provide this information?
There are 3 steps
- * Step 1. The submitter should contact GDC User Service about any new Target Capture Kits that do not already exist in the GDC Dictionary. The GDC Bioinformatics and User Services teams will work together with the submitter to create a meaningful name for the kit, and import this name and Target Region Bed File into the GDC.
+ * Step 1. The submitter should contact GDC User Service about any new Target Capture Kits that do not already exist in the GDC Dictionary. The GDC Bioinformatics and User Services teams will work together with the submitter to create a meaningful name for the kit and import this name and Target Region Bed File into the GDC.
* Step 2. The submitter can then select one and only one GDC Target Capture Kit for each read group during molecular data submission.
- * Step 3. The submitter should also selection the appropriate `library_strategy` and `library_selection` on the read_group entity.
+ * Step 3. The submitter should also select the appropriate `library_strategy` and `library_selection` on the read_group entity.
4. What is a Target Region Bed File?
-A Target Region Bed File is tab-delimited file describing the kit target region in bed format (https://genome.ucsc.edu/FAQ/FAQformat.html#format1). The first 3 columns of such files are chrom, chromStart, and chromEnd.
+A Target Region Bed File is tab-delimited file describing the kit target region in [bed format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1). The first 3 columns of such files are chrom, chromStart, and chromEnd.
Note that by definition, bed files are 0-based or "left-open, right-closed", which means bed interval "chr1 10 20" only contains 10 bases on chr1, from the 11th to the 20th.
In addition, submitters should also let GDC know the genome build (hg18, hg19 or GRCh38) of their bed files.
5. Is a Target Capture Kit uniquely defined by its Target Region Bed File?
-Not necessary. Sometimes, users or manufactures may want to augment an existing kit with additional probes, in order to capture more regions or simply improve the quality of existing regions. In the later case, the bed file stays the same, but it is now a different Target Capture Kit and should be registered separately as described in Step 3 above.
+Not necessarily. Sometimes, users or manufactures may want to augment an existing kit with additional probes, in order to capture more regions or simply improve the quality of existing regions. In the latter case, the bed file stays the same, but it is now a different Target Capture Kit and should be registered separately as described in Step 3 above.
-## Specifying Tumor Normal Pairs for analysis
+## Specifying Tumor Normal Pairs for Analysis
It is critical for many cancer bioinformatics pipelines to specify which normal sample to use to factor out germline variation. In particular, this is a necessary specification for all tumor normal paired variant calling pipelines. The following details describe how the GDC determines which normal sample to use for variant calling.
@@ -150,18 +140,24 @@ It is critical for many cancer bioinformatics pipelines to specify which normal
* If there are multiple normals of the same experimental_strategy for a case:
* Users can specify which normal to use by specifying on the aliquot. To do so one of the following should be set to `TRUE` for the specified experimental strategy: `selected_normal_low_pass_wgs`, `selected_normal_targeted_sequencing`, `selected_normal_wgs`, or `selected_normal_wxs`.
* Or if no normal is specified the GDC will select the best normal for that patient based on the following criteria. This same logic will also be used if multiple normal are selected.
- * If a case has blood cancer we will use sample type in the following priority order: Blood Derived Normal > Bone Marrow Normal > Mononuclear Cells from Bone Marrow Normal > Fibroblasts from Bone Marrow Normal > Lymphoid Normal > Buccal Cell Normal > Solid Tissue Normal > EBV Immortalized Normal
- * If case does not have blood cancer we will use sample type in the following priority order:
- Solid Tissue Normal > Buccal Cell Normal > Lymphoid Normal > Fibroblasts from Bone Marrow Normal > Mononuclear Cells from Bone Marrow Normal > Bone Marrow Normal > Blood Derived Normal > EBV Immortalized Normal
- * If there are still ties we will choose the aliquot submitted first
-* If there are no normals
- * The GDC will not run tumor only variant calling pipeline by default. The submitter must specify one of the following properties as TRUE: `no_matched_normal_low_pass_wgs`, `no_matched_normal_targeted_sequencing`, `no_matched_normal_wgs`, `no_matched_normal_wxs`.
+ * If a case has blood cancer we will use sample type in the following priority order:
+
+ Blood Derived Normal > Bone Marrow Normal > Mononuclear Cells from Bone Marrow Normal > Fibroblasts from Bone Marrow Normal > Lymphoid Normal > Buccal Cell Normal > Solid Tissue Normal > EBV Immortalized Normal
-Note that we will only run variant calling for a particular tumor aliquot per experimental strategy once. You must make sure that the appropriate normal control is uploaded to the GDC when Requesting Submission. Uploading a different normal sample later will not result in reanalysis by the GDC.
+ * If a case does not have blood cancer we will use sample type in the following priority order:
+ Solid Tissue Normal > Buccal Cell Normal > Lymphoid Normal > Fibroblasts from Bone Marrow Normal > Mononuclear Cells from Bone Marrow Normal > Bone Marrow Normal > Blood Derived Normal > EBV Immortalized Normal
+ * If there are still ties, we will choose the aliquot submitted first.
+* If there are no normals.
+ * The GDC will not run tumor only variant calling pipeline by default. The submitter must specify one of the following properties as TRUE: `no_matched_normal_low_pass_wgs`, `no_matched_normal_targeted_sequencing`, `no_matched_normal_wgs`, `no_matched_normal_wxs`.
+Note that we will only run variant calling for a particular tumor aliquot per experimental strategy once. You must make sure that the appropriate normal control is uploaded to the GDC when Requesting Submission. Uploading a different normal sample later will not result in reanalysis by the GDC.
## Clinical Data Requirements
For the GDC to release a project there is a minimum number of clinical properties that are required. Minimal cross-project GDC requirements include age, gender, and diagnosis information. Other requirements may be added when the submitter is approved for submission to the GDC.
+
+## miRNA Submission
+
+The GDC requires that miRNA reads be trimmed before being uploaded to the GDC because miRNA datasets can have different trimming schemas. Uploading untrimmed miRNA reads can delay harmonization until the problem is resolved.
diff --git a/docs/Data_Submission_Portal/Users_Guide/Checklist.md b/docs/Data_Submission_Portal/Users_Guide/Checklist.md
new file mode 100644
index 000000000..d984f32c7
--- /dev/null
+++ b/docs/Data_Submission_Portal/Users_Guide/Checklist.md
@@ -0,0 +1,39 @@
+# Before Submitting Data to the GDC Portal
+
+## Overview
+The National Cancer Institute (NCI) Genomic Data Commons (GDC) Data Submission Portal User's Guide is the companion documentation for the [GDC Data Submission Portal](https://gdc.cancer.gov/submit-data/gdc-data-submission-portal) and provides detailed information and instructions for its use.
+
+## Steps to Submit Data to the GDC
+The following tasks are required to submit data to the [GDC Data Submission Portal](https://portal.gdc.cancer.gov/).
+
+1. Complete the GDC Data [Submission Request Form](https://gdc.cancer.gov/data-submission-request-form). After submission, the reqest will be reviewed by the GDC Data Submission Review Committee. During this time, create an [eRA Commons account](https://era.nih.gov/registration_accounts.cfm) if you do not already have one.
+
+2. If the study is approved, contact a [Genomic Program Administrator (GPA)](https://osp.od.nih.gov/genomic-program-administrators/) to register the approved study in [dbGaP](https://www.ncbi.nlm.nih.gov/sra/docs/submitdbgap). This includes registering the project as a GDC Trusted Partner study, registering cases, and adding authorized data submitters. For more information, see the [Data Submission Process](https://gdc.cancer.gov/submit-data/data-submission-processes-and-tools).
+
+3. Contact GDC User Services to create a submission project. The User Services team will require a project ID, which is a two-part identifier, where the first portion is the __Program__ followed by a hyphen (__-__) and the second portion is the __Project__. This must be alphanumeric and all caps only. An example would be `TCGA-BRCA`. You must also create a project name, which can be longer and has fewer requirements on length or character usage. An example would be `Breast Invasive Carcinoma`.
+
+## Key Features
+The GDC Data Submission Portal is a platform that allows researchers to submit and release data to the GDC. The key features of the GDC Data Submission Portal are:
+
+* __Upload and Validate Data__: Project data can be uploaded to the GDC project workspace. The GDC will validate the data against the [GDC Data Dictionary](../../Data_Dictionary/viewer.md).
+* __Browse Data__: Data that has been uploaded to the project workspace can be browsed to ensure that the project is ready for processing.
+* __Download Data__: Data that has been uploaded into the project workspace can be downloaded for review or update by using the [API](https://docs.gdc.cancer.gov/API/Users_Guide/Downloading_Files/) or the [Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool).
+* __Review and Submit Data__: Prior to submission, data can be reviewed to check for accuracy and completeness. Once the review is complete, the data can be submitted to the GDC for processing through [Data Harmonization](https://gdc.cancer.gov/submit-data/gdc-data-harmonization).
+* __Release Data__: After harmonization, data can be released to the research community for access through the [GDC Data Portal](https://portal.gdc.cancer.gov/) and other [GDC Data Access Tools](https://gdc.cancer.gov/access-data/data-access-processes-and-tools).
+* __Status and Alerts__: Visual cues are implemented in the GDC Data Submission Portal Dashboard to easily identify incomplete submissions via panel displays summarizing submitted data and associated data elements.
+* __Transactions__: A list of all actions performed in a project is provided with detailed information for each action.
+
+## Sections to the Data Submission Portal Guide
+
+* [__Data Submission Overview__](Data_Submission_Overview.md): Graphical explanations used to display the life cycle of projects and their data.
+* [__Data Submission Process__](Data_Submission_Process.md): An overview of the data submission process using the GDC Data Submission Portal.
+* [__Data Submission Walkthrough__](Data_Submission_Walkthrough.md): Step-by-step instructions on GDC data submission and their relationship to the GDC Data Model.
+* [__Pre-Release Data Portal__](Pre_Release_QC.md): Instructions on how to use the Pre-Release Data Portal for projects that have been harmonized but not released.
+
+## HIPAA Compliance
+
+The GDC will not accept any data for patients age 90 and over including any follow-up events in which the event occurs after a patient turns 90 to ensure that HIPAA compliance is maintained. To comply with these requirements data submitters may omit any data (entire cases or specific nodes) that would violate this rule or obfuscate associated dates. Please see the [Date Obfuscation](/Data_Submission_Portal/Users_Guide/Best_Practices/#date-obfuscation) section for more information.
+
+## Release Notes
+
+The [Release Notes](../../Data_Submission_Portal/Release_Notes/Data_Submission_Portal_Release_Notes.md) section of this User's Guide contains details about new features, bug fixes, and known issues.
diff --git a/docs/Data_Submission_Portal/Users_Guide/Data_Submission_Overview.md b/docs/Data_Submission_Portal/Users_Guide/Data_Submission_Overview.md
new file mode 100644
index 000000000..1af5cf970
--- /dev/null
+++ b/docs/Data_Submission_Portal/Users_Guide/Data_Submission_Overview.md
@@ -0,0 +1,111 @@
+# Data Submission Overview
+
+## Overview
+This section will walk users through two parts of the submission process. The first portion will be the steps taken by the users to go through the submission process from start to finish. The second portion will describe the lifecycle of a project and a file throughout the data submission process.
+
+## GDC Data Submission Workflow
+
+The diagram below illustrates the process from uploading through releasing data in the GDC Data Submission Portal. To review the steps needed before beginning submission see [Before Submitting Data to the GDC Portal](https://docs.gdc.cancer.gov/Data_Submission_Portal/Users_Guide/Checklist/)
+
+[![GDC Data Submission Portal Workflow Upload](images/GDC_Data_Submission_Workflow-updated_20190301.jpg)](images/GDC_Data_Submission_Workflow-updated_20190301.jpg "Click to see the full image.")
+
+### Review GDC Dictionary and GDC Data Model - Submitter Activity
+
+It is suggested that all submitters review the [GDC Dictionary](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/) and [GDC Data Model](https://gdc.cancer.gov/developers/gdc-data-model/gdc-data-model-components). It is beneficial for submitters to know which nodes will need metadata submission, how these nodes relate to each other, and what information is required for each node in the model.
+
+### Download Templates - Submitter Activity
+
+After determining the required nodes for the submission, go to each node page in the [GDC Dictionary](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/). There will be a "Download Template" drop down list. Select the file format, either TSV or JSON, and download the template for the node. If [numerous entries](Data_Submission_Walkthrough.md#submitting-numerous-cases) are being submitted all at one time, it is suggested that the user uses a TSV template. At this point, it is suggested to go through the template and remove fields that will not be populated by the metadata submission, but make sure to complete all fields that are required for the node. For more information about the Data Dictionary, please visit [here](../../../Data_Dictionary/).
+
+[`See GDC Data Dictionary here.`](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/)
+
+### Upload Case Information Including dbGaP Submitted Subject IDs - Submitter Activity
+
+After registering the study in [dbGaP](https://gdc.cancer.gov/submit-data/obtaining-access-submit-data), the first node to be created in the data model is the [`case` node](Data_Submission_Walkthrough.md#case-submission). The `case` node is important as it will contain a unique `submitter_id` that is registered in dbGaP under a particular project. This will connect the two databases, dbGaP and GDC, and allows for access to be granted to a controlled data set based on the study and its cases.
+
+To [submit the `case`](Data_Submission_Walkthrough.md#uploading-the-case-submission-file) nodes, a user must be able to [login](Data_Submission_Process.md#authentication) and access the [GDC Submission Portal](https://portal.gdc.cancer.gov/submission/) for their respective project. Metadata for all nodes are uploaded via the [API](https://docs.gdc.cancer.gov/API/Users_Guide/Submission/#creating-and-updating-entities) or through the [Submission Portal](Data_Submission_Walkthrough.md#upload-using-the-gdc-data-submission-portal).
+
+[`See case example here.`](Data_Submission_Walkthrough.md#case-submission)
+
+[`See metadata upload example here.`](Data_Submission_Walkthrough.md#upload-using-the-gdc-data-submission-portal)
+
+### Upload Clinical and Biospecimen Data - Submitter Activity
+
+With the creation of `case` nodes, other nodes in the [data model](https://gdc.cancer.gov/developers/gdc-data-model/gdc-data-model-components) can be [uploaded](Data_Submission_Walkthrough.md#upload-using-the-gdc-data-submission-portal). This includes the [Clinical](Data_Submission_Walkthrough.md#clinical-data-submission) and [Biospecimen](Data_Submission_Walkthrough.md#biospecimen-submission) nodes, with examples for each that can be found in the [Data Upload Walkthrough](Data_Submission_Walkthrough.md).
+
+[`See clinical example here.`](Data_Submission_Walkthrough.md#clinical-data-submission)
+
+[`See biospecimen example here.`](Data_Submission_Walkthrough.md#biospecimen-submission)
+
+[`See metadata upload example here.`](Data_Submission_Walkthrough.md#upload-using-the-gdc-data-submission-portal)
+
+### Register Data Files - Submitter Activity
+
+Registering data files is necessary before they can be uploaded. This allows the GDC to later validate the uploads against the user-supplied md5sum and file size. The [submission](Data_Submission_Walkthrough.md#experiment-data-submission) of these files can range from clinical and biospecimen supplements to `submitted_aligned_reads` and `submitted_unaligned_reads`.
+
+[`See experiment data example here.`](Data_Submission_Walkthrough.md#experiment-data-submission)
+
+### Upload Data Using Data Transfer Tool - Submitter Activity
+
+Before uploading the submittable data files to the GDC, a user will need to determine if the correct nodes have been created and the information within them are correct. This is accomplished using the [Browse](Data_Submission_Process.md#browse) page in the [Data Submission Portal](https://portal.gdc.cancer.gov/submission). Here you can find the metadata and file_state, which must have progressed to `registered` for an associated file to be uploaded. You can find more about the file life cycle [here](#file-lifecycle).
+
+Once the submitter has verified that the submittable data files have been registered, the user can obtain the submission manifest file that is found on the [Project Overview](Data_Submission_Process.md#project-overview) page. From this point the submission process is described in the ["Uploading the Submittable Data File to the GDC"](Data_Submission_Walkthrough.md#uploading-the-submittable-data-file-to-the-gdc) section.
+
+For strategies on data upload, further documentation for the GDC Data Submission process is detailed on the [Data Submission Processes and Tools](https://gdc.cancer.gov/submit-data/data-submission-processes-and-tools) section of the GDC Website.
+
+[`See submittable data file upload example here.`](Data_Submission_Walkthrough.md#uploading-the-submittable-data-file-to-the-gdc)
+
+### Verify Accuracy and Completeness of Project Data - Submitter Activity
+
+The submitter is responsible for reviewing the data uploaded to the project workspace, see [Data Submission Walkthrough](Data_Submission_Walkthrough.md), and ensuring that it is ready for processing by the GDC [Harmonization Process](https://gdc.cancer.gov/submit-data/gdc-data-harmonization). A user should be able to go through the [Pre-Harmonization Checklist](Data_Submission_Process.md#pre-harmonization-checklist), and verify that their submission meets these criteria.
+
+[`See pre-harmonization checklist here.`](Data_Submission_Process.md#pre-harmonization-checklist)
+
+### Request Data Harmonization - Submitter Activity
+
+When the project is complete and ready for processing, the submitter will [request harmonization](Data_Submission_Process.md#submit-your-workspace-data-to-the-gdc). If the project is not ready for processing, the project can be re-opened and the submitter will be able to upload more data to the project workspace.
+
+[`See harmonization request example here.`](Data_Submission_Process.md#submit-your-workspace-data-to-the-gdc)
+
+> __NOTE:__ The GDC requests that users submit their data to the GDC within six months from the first upload of data to the project workspace.
+
+### GDC Review/QC Submitted Data - GDC Activity
+
+The Bioinformatics Team at the GDC runs the Quality Control pipeline on the submitted data. This pipeline mirrors the [Pre-Harmonization Checklist](Data_Submission_Process.md#pre-harmonization-checklist) and will determine if the submission is complete and is ready for the Harmonization pipeline. If the submission does contain problems, the GDC will contact the user to "Re-Open" the project and fix the errors in their submission.
+
+Once the review is complete, all validated nodes will be changed to state "submitted". At this point users can submit more files to a project, but they will be considered as a different batch for harmonization.
+
+### GDC Harmonize Data - GDC Activity
+
+After the submission passes the GDC Quality Control pipeline, it will be queued for the [GDC Harmonization pipeline](https://gdc.cancer.gov/about-data/gdc-data-harmonization).
+
+### Submitter Review/QC of Harmonized Data - Submitter Activity
+
+After the data is processed in the Harmonization pipeline, the GDC asks submitters to [verify the quality](https://portal.gdc.cancer.gov/submission/login?next=%2Fsubmission%2F) of their harmonized data. It is the user's responsibility to notify the GDC of any errors in their harmonized data sets. The GDC will then work with the user to correct the issue and rerun the Harmonization pipeline if needed.
+
+### Release Data Within Six Months - Submitter Activity
+
+Project release occurs after the data has been harmonized, and allows users to access this data with the [GDC Data Portal](https://portal.gdc.cancer.gov/) and other [GDC Data Access Tools](https://gdc.cancer.gov/access-data/data-access-processes-and-tools). The GDC will release data according to [GDC Data Sharing Policies](https://gdc.cancer.gov/submit-data/data-submission-policies). Data must be released within six months after GDC data processing has been completed, or the submitter may request earlier release.
+
+[`See release example here.`](Data_Submission_Process.md#release)
+
+>__Note__: Released cases and/or files can be redacted from the GDC. For more information, visit the [GDC Policies page (under GDC Data Sharing Policies)](https://gdc.cancer.gov/submit-data/data-submission-policies).
+
+### GDC Releases Data - GDC Activity
+
+GDC data releases are not continuous, but instead are released in discrete data updates. Once harmonized data is approved and release request is approved, data will be available in an upcoming GDC Data Release.
+
+## Project and File Lifecycles
+
+### Project Lifecycle
+The diagram of the project lifecycle below demonstrates the transition of a project through the various states. Initially the project is open for data upload and validation. Any changes to the data must be made while the project status is open. When the data is uploaded and ready for review, the submitter changes the project state to review. During the review state, the project is locked and additional data cannot be uploaded. If data changes are needed during the review period, the project has to be re-opened.
+
+The process of Harmonization does not occur immediately after submitted files are uploaded. After the submission is complete and all the necessary data and files have been uploaded, the user submits the data to the GDC for processing through the [GDC Data Harmonization Pipelines](https://gdc.cancer.gov/submit-data/gdc-data-harmonization) and the project state changes to submitted. When the data has been processed, the project state changes back to open for new data to be submitted to the project and the submitter can review the processed data. After review of the processed data, the submitter can then release the harmonized data to the [GDC Data Portal](https://portal.gdc.cancer.gov/) and other [GDC Data Access Tools](https://gdc.cancer.gov/access-data/data-access-processes-and-tools) according to [GDC Data Sharing Policies](https://gdc.cancer.gov/submit-data/data-submission-policies).
+
+[![GDC Data Submission Portal Workflow](images/Submission.png)](images/Submission.png "Click to see the full image.")
+
+### File Lifecycle
+
+This section describes states pertaining to submittable data files throughout the data submission process. A submittable data file could contain data such as genomic sequences (such as a BAM or FASTQ) or pathology slide images. The file lifecycle starts when a submitter uploads metadata for a file to the [GDC Data Submission Portal](https://portal.gdc.cancer.gov/submission/). This metadata file registers a description of the file as an entity on the GDC, the status for this is known as "state" and is represented by __purple__ cirlces. The submitter can then use the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool) to upload the actual file, which is represeneted by __red__ circles. Throughout the lifecycle, the file and its associated entity transition through various states from when they are initially registered through file submission and processing. The diagram below details these state transitions.
+
+[![GDC Data Submission Portal File Status](images/gdc-submission-portal-file-state-vs-state.png)](images/gdc-submission-portal-file-state-vs-state.png "Click to see the full image.")
diff --git a/docs/Data_Submission_Portal/Users_Guide/Data_Submission_Process.md b/docs/Data_Submission_Portal/Users_Guide/Data_Submission_Process.md
new file mode 100644
index 000000000..dda762ceb
--- /dev/null
+++ b/docs/Data_Submission_Portal/Users_Guide/Data_Submission_Process.md
@@ -0,0 +1,330 @@
+# Data Submission Portal
+
+## Overview
+
+This section will walk users through the submission process using the [GDC Data Submission Portal](https://portal.gdc.cancer.gov/submission/) to upload files to the GDC.
+
+## Authentication
+
+### Requirements
+
+Accessing the GDC Data Submission Portal requires eRA Commons credentials with appropriate dbGaP authorization. To learn more about obtaining the required credentials and authorization, see [Obtaining Access to Submit Data]( https://gdc.cancer.gov/submit-data/obtaining-access-submit-data).
+
+### Authentication via eRA Commons
+
+Users can log into the GDC Data Submission Portal with eRA Commons credentials by clicking the "Login" button. If authentication is successful, the user will be redirected to the GDC Data Submission Portal front page and the user's eRA Commons username will be displayed in the upper right corner of the screen.
+
+#### GDC Authentication Tokens
+
+The GDC Data Portal provides authentication tokens for use with the GDC Data Transfer Tool or the GDC API. To download a token:
+
+1. Log into the GDC using your eRA Commons credentials.
+2. Click the username in the top right corner of the screen.
+3. Select the "Download Token" option.
+
+![Token Download Button](images/gdc-data-portal-token-download.png)
+
+A new token is generated each time the `Download Token` button is clicked.
+
+For more information about authentication tokens, see [Data Security](../../Data/Data_Security/Data_Security.md#authentication-tokens).
+
+>**NOTE:** The authentication token should be kept in a secure location, as it allows access to all data accessible by the associated user account.
+
+#### Logging Out
+
+To log out of the GDC, click the username in the top right corner of the screen, and select the Logout option. Users will automatically be logged out after 15 minutes of inactivity.
+
+![Logout link](images/gdc-data-portal-token-download.png)
+
+## Homepage
+
+After authentication, users are redirected to a homepage. The homepage acts as the entry point for GDC data submission and provides submitters with access to a list of authorized projects, reports, and transactions. Content on the homepage varies based on the user profile (e.g. submitter, program office).
+
+[![GDC Submitter Home Page](images/GDC-HomePage-Submit_v2.png)](images/GDC-HomePage-Submit_v2.png "Click to see the full image.")
+
+### Reports
+
+Project summary reports can be downloaded at the Submission Portal homepage at three different levels: `CASE OVERVIEW`, `ALIQUOT OVERVIEW`, and `DATA VALIDATION`. Each report is generated in tab-delimited format in which each row represents an active project.
+
+* __`CASE OVERVIEW`:__ This report describes the number of cases with associated biospecimen data, clinical data, or submittable data files (broken down by data type) for each project.
+* __`ALIQUOT OVERVIEW`:__ This report describes the number of aliquots in a project with associated data files. Aliquot numbers are broken down by sample tissue type.
+* __`DATA VALIDATION`:__ This report categorizes all submittable data files associated with a project by their file status.
+
+### Projects
+
+The projects section in the homepage lists the projects that the user has access to along with basic information about each project. For users with access to a large number of projects, this table can be filtered using the 'FILTER PROJECTS' field. Selecting a project ID will direct the user to the project's [Dashboard](#dashboard). The button used to release data for each project is also located on this screen, see [Release](#release) for details.
+
+## Dashboard
+
+The GDC Data Submission Portal dashboard provides details about a specific project.
+
+[![GDC Submission Dashboard Page](images/GDC_Submission_Dashboard_4.png)](images/GDC_Submission_Dashboard_4.png "Click to see the full image.")
+
+The dashboard contains various visual elements to guide the user through all stages of submission, from viewing the [Data Dictionary](https://docs.gdc.cancer.gov/Data_Dictionary/), support of data upload, to submitting a project for harmonization.
+
+To better understand the information displayed on the dashboard and the available actions, please refer to the [Data Submission Walkthrough](Data_Submission_Walkthrough.md).
+
+### Project Overview
+The Project Overview sections of the dashboard displays the most current project state (open / review / submitted / processing) and the GDC Release, which is the date in which the project was released to the GDC.
+
+The search field at the top of the dashboard allows for submitted entities to be searched by partial or whole `submitter_id`. When a search term is entered into the field, a list of entities matching the term is updated in real time. Selecting one of these entities links to its details in the [Browse Tab](#browse).
+
+The remaining part of the top section of the dashboard is broken down into four status charts:
+
+* __Cases with Clinical__: The number of `cases` for which Clinical data has been uploaded.
+* __Cases with Biospecimen__: The number of `cases` for which Biospecimen data has been uploaded.
+* __Cases with Submittable Data Files__: The number of `cases` for which experimental data has been uploaded.
+* __Submittable Data Files__: The number of files uploaded through the GDC Data Transfer Tool. For more information on this status chart, please refer to [File Lifecycle](Data_Submission_Overview.md#file-lifecycle).
+ * __`DOWNLOAD MANIFEST`:__ This button below the status chart allows the user to download a manifest for registered files in this project that have not yet been uploaded.
+
+### Action Panels
+
+There are two action panels available below the Project Overview.
+
+* [UPLOAD DATA TO YOUR WORKSPACE](Data_Submission_Walkthrough.md): Allows a submitter to upload project data to the GDC project workspace. The GDC will validate the uploaded data against the [GDC Data Dictionary](https://docs.gdc.cancer.gov/Data_Dictionary/). This panel also contains a table that displays details about the five latest transactions. Clicking the IDs in the first column will bring up a window with details about the transaction, which are documented in the [transactions](#transactions) page. This panel will also allow the user to commit file uploads to the project.
+* [REVIEW AND SUBMIT YOUR WORKSPACE DATA TO THE GDC](#submit-your-workspace-data-to-the-gdc): Allows a submitter to review project data which will lock the project to ensure that additional data cannot be uploaded while in review. Once the review is complete, the data can be submitted to the GDC for processing through the [GDC Harmonization Process](https://gdc.cancer.gov/submit-data/gdc-data-harmonization).
+
+These actions and associated features are further detailed in their respective sections of the documentation.
+
+## Transactions
+
+The transactions page lists all of the project's transactions. The transactions page can be accessed by choosing the Transactions tab at the top of the dashboard or by choosing "View All Data Upload Transactions" in the first panel of the dashboard.
+
+[![GDC Submission Transactions](images/GDC_Submission_Transactions_2.png)](images/GDC_Submission_Transactions_2.png "Click to see the full image.")
+
+The types of transactions are the following:
+
+* __Upload:__ The user uploads data to the project workspace. Note that submittable data files uploaded using the GDC Data Transfer tool do not appear as transactions. Uploaded submittable data can be viewed in the Browse tab.
+* __Delete:__ The user deletes data from the project workspace.
+* __Review:__ The user reviews the project before submitting data to the GDC.
+* __Open:__ The user re-opens the project if it was under review. This allows the upload of new data to the project workspace.
+* __Submit:__ The user submits uploaded data to the GDC. This triggers the data harmonization process.
+* __Release:__ The user releases harmonized data to be available through the GDC Data Portal and other GDC data access tools.
+
+### Transactions List View
+
+The transactions list view displays the following information:
+
+|Column|Description|
+| --- | --- |
+| __ID__ | Identifier of the transaction |
+| __Type__ | Type of the transaction (see the list of transaction types in the previous section)|
+| __Step__ | The step of the submission process that each file is currently in. This can be Validate or Commit. "Validate" represents files that have not yet been committed but have been uploaded using the submission portal or the API. |
+| __DateTime__ | Date and Time that the transaction was initiated |
+| __User__ | The username of the submitter that performed the transaction |
+| __State__ | Indicates the status of the transaction: `SUCCEEDED`, `PENDING`, or `FAILED` |
+| __Commit/Discard__ | Two buttons appear when data has been uploaded using the API or the submission portal. This allows for validated data to be incorporated into the project or discarded. This column will then display the transaction number for commited uploads and "Discarded" for the uploads that are discarded.|
+
+### Transaction Filters
+
+Choosing from the drop-down menu at the top of the table allows the transactions to be filtered by those that are in progress, to be committed, succeeded, failed, or discarded. The drop-down menu also allows for the transactions to be filtered by type and step.
+
+### Transactions Details
+
+Clicking on a transaction will open the details panel. Data in this panel is organized into multiple sections including actions, details, types, and documents as described below.
+
+[![GDC Submission Transactions](images/GDC_Submission_Transactions_Details_3.png)](images/GDC_Submission_Transactions_Details_3.png "Click to see the full image.")
+
+Navigation between the sections can be performed by either scrolling down or by clicking on the section icon displayed on the left side of the details panel.
+
+#### Actions
+
+The Actions section allows a user to perform an action for transactions that provide actions. For example, if a user uploads read groups and file metadata, a corresponding manifest file will be available for download from the transaction. This manifest is used to upload the actual files through the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool).
+
+[![GDC Submission Transactions Details Action](images/GDC_Submission_Transactions_Details_Action_2.png)](images/GDC_Submission_Transactions_Details_Action_2.png "Click to see the full image.")
+
+#### Details
+
+The Details section provides details about the transaction itself, such as its project, type, and number of affected cases.
+
+[![GDC Submission Transactions Details](images/GDC_Submission_Transactions_Details_Details_2.png)](images/GDC_Submission_Transactions_Details_Details_2.png "Click to see the full image.")
+
+#### Types
+
+The Types section lists the type of files submitted and the number of affected cases and entities.
+
+[![GDC Submission Transactions Types](images/GDC_Submission_Transactions_Details_Types_2.png)](images/GDC_Submission_Transactions_Details_Types_2.png "Click to see the full image.")
+
+#### Documents
+
+The Documents section lists the files submitted during the transaction.
+The user can download the original files from the transaction, a report detailing the transaction, or the errors that originated from the transaction that has failed.
+
+[![GDC Submission Transactions Documents](images/GDC_Submission_Transactions_Details_Documents_2.png)](images/GDC_Submission_Transactions_Details_Documents_2.png "Click to see the full image.")
+
+## Browse
+
+The `Browse` menu provides access to all of a project's content. Most content is driven by the GDC Data Dictionary and the interface is dynamically generated to accommodate the content.
+
+Please refer to the [GDC Data Dictionary Viewer](../../Data_Dictionary/viewer.md) for specific details about dictionary-generated fields, columns, and filters.
+
+[![GDC Submission Cases Default View](images/GDC_Submission_Cases_Default_2.png)](images/GDC_Submission_Cases_Default_2.png "Click to see the full image.")
+
+### Main Interface Elements
+
+#### Filters
+
+A wide set of filters are available for the user to select the type of entity to be displayed. These filters are dynamically created based on the [GDC Data Dictionary](../../Data_Dictionary/index.md).
+
+Current filters are:
+
+|Filter|Description|
+| --- | --- |
+| __Cases__ | Display all `Cases` associated with the project. |
+| __Clinical__ | Display all Clinical data uploaded to the project workspace. This is divided into subgroups including `Demographics`, `Diagnoses`, `Exposures`, `Family Histories`, `Follow_up`, `Molecular_tests`, and `Treatments`. |
+| __Biospecimen__ | Display all Biospecimen data uploaded to the project workspace. This is divided into subgroups including `Samples`, `Portions`, `Slides`, `Analytes`, `Aliquots`, and `Read Groups`. |
+| __Submittable Data Files__ | Displays all data files that have been registered with the project. This includes files that have been uploaded and those that have been registered but not uploaded yet. This category is divided into groups by file type. |
+| __Annotations__ | Lists all annotations associated with the project. An annotation provides an explanatory comment associated with data in the project. |
+| __Harmonized Data Files__ | Lists all data files that have been harmonized by the GDC. This category is divided into groups by generated data. |
+
+#### List View
+
+The list view is a paginated list of all entities corresponding to the selected filter.
+
+On the top-right section of the screen, the user can download data about all entities associated with the selected filter.
+
+* For the case filter, it will download all Clinical data or all Metadata.
+* For all other filters, it will download the corresponding metadata (e.g., for the `demographic` filter, it will download all `demographic` data).
+
+[![GDC Submission Case Summary Download](images/GDC_Submission_Cases_Summary_Download_4.png)](images/GDC_Submission_Cases_Summary_Download_2.png "Click to see the full image.")
+
+#### Details Panel
+
+Clicking on an entity will open the details panel. Data in this panel is broken down into multiple sections depending on the entity type. The main sections are:
+
+* __Actions__: Actions that can be performed relating the entity. This includes downloading the metadata (JSON or TSV) or submittable data file pertaining to the entity and deleting the entity. See the [Deleting Entities](Data_Submission_Walkthrough.md#deleting-submitted-entities) guide for more information.
+* __Summary__: A list of IDs and system properties associated with the entity.
+* __Details__: Properties of the entity (not associated with cases).
+* __Hierarchy__ or __Related Entities__: A list of associated entities.
+* __Annotations__: A list of annotations associated with the entity.
+* __Transactions__: A list of previous transactions that affect the entity.
+
+[![GDC Submission Case Details](images/GDC_Submission_Cases_Details_2.png)](images/GDC_Submission_Cases_Details_2.png "Click to see the full image.")
+
+The sections listed above can be navigated either by scrolling down or by clicking on the section icon on the left side of the details panel.
+
+#### Related Entities
+
+The Related Entities table lists all entities, grouped by type, related to the selected `case`. This section is only available at the `case` level.
+
+[![GDC Submission Cases Related Entities](images/GDC_Submission_Cases_Summary_Related_Entities_2.png)](images/GDC_Submission_Cases_Summary_Related_Entities_2.png "Click to see the full image.")
+
+
+This table contains the following columns:
+
+* __Category__: category of the entity (Clinical, Biospecimen, submittable data file).
+* __Type__: type of entity (based on Data Dictionary).
+* __Count:__ number of occurrences of an entity associated with the `case`. Clicking on the count will open a window listing those entities within the Browse page.
+
+#### Hierarchy
+
+The hierarchy section is available for entities at any level (e.g., Clinical, Biospecimen, etc.), except for `case`. The user can use the hierarchy section to navigate through entities.
+
+The hierarchy shows:
+
+* The `case` associated with the entity.
+* The __direct__ parents of the entity.
+* The __direct__ children of the entity.
+
+[![GDC Submission Cases Details Hierarchy](images/GDC_Submission_Cases_Summary_Hierarchy_2.png)](images/GDC_Submission_Cases_Summary_Hierarchy_2.png "Click to see the full image.")
+
+After uploading data to the workspace on the GDC Data Submission Portal, data will need to be [reviewed by the submitter](#pre-harmonization-checklist) and then [submitted to the GDC](#submit-to-the-gdc) for processing.
+
+## Submit Your Workspace Data to the GDC
+
+The GDC Data Submission process is detailed on the [Data Submission Processes and Tools](https://gdc.cancer.gov/submit-data/data-submission-processes-and-tools) section of the GDC Website.
+
+### Review
+
+The submitter is responsible for reviewing the data uploaded to the project workspace (see [Data Submission Walkthrough](Data_Submission_Walkthrough.md)), and ensuring that it is ready for processing by the GDC [Harmonization Process](https://gdc.cancer.gov/submit-data/gdc-data-harmonization).
+
+The user will be able to view the section below on the dashboard. The `REVIEW` button is available only if the project is in "OPEN" state.
+
+[![GDC Submission Review Tab](images/GDC_Submission_Submit_Release_Review_tab_2_v2.png)](images/GDC_Submission_Submit_Release_Review_tab_2_v2.png "Click to see the full image.")
+
+Setting the project to the "REVIEW" state will lock the project and prevent users from uploading additional data. During this period, the submitter can browse the data in the Data Submission Portal or download it. Once the review is complete, the user can request to submit data to the GDC.
+
+Once the user clicks on `REVIEW`, the project state will change to "REVIEW":
+
+[![GDC Submission Review State](images/GDC_Submission_Submit_Release_Project_State_Review_3.png)](images/GDC_Submission_Submit_Release_Project_State_Review_3.png "Click to see the full image.")
+
+### Pre-Harmonization Checklist
+
+The Harmonization step is __NOT__ an automatic process that occurs when data is uploaded to the GDC. The GDC performs batch processing of submitted data for Harmonization only after verifying that the submission is complete.
+
+The following tests must pass before the data can be considered complete:
+
+1. All files that are registered have been uploaded and validated.
+
+2. There are no invalid characters in the `submitter_id` of any node.
+The acceptable characters are alphanumeric characters [a-z, A-Z, 0-9] and `_`, `.`, `-`. Any other characters will interfere with the Harmonization workflow.
+
+3. There are no data files with duplicate md5sums.
+
+4. Clinical data nodes such as `demographic`, `diagnosis` and `clinical_supplement`, are linked to `case`.
+
+5. The `read_group` node is linked to a valid node:
+ * `submitted_unaligned_reads`
+ * `submitted_aligned_reads`
+ * `submitted_genomic_profile`
+
+6. The `sample`-`analyte`-`aliquot` relationships are valid. Common problems can sometimes be:
+ * `aliquot` attached to `sample` nodes of more than one type.
+ * `aliquot` attached to more than one `sample` node, potentially valid but unusual.
+
+7. Each `aliquot` node is only associated with one `submitted_aligned_reads` file of the same `experimental_strategy`.
+
+8. The information for the `platform` is in the `read_group` node. While the subsequent information about the platform is not required, it is beneficial to also have information on:
+ * `multiplex_barcode`
+ * `flow_cell_barcode`
+ * `lane_number`
+
+9. In `read_group`, the `library_strategy` should match the `library_selection`:
+ * Targeted Sequencing must be with either PCR or Hybrid Selection.
+ * WXS must be with Hybrid Selection.
+ * WGS must be with Random.
+
+10. The `target_capture_kit` property is completed when the selected `library_strategy` is `WXS`. Errors will occur if `Not Applicable` or `Unknown` is selected.
+
+11. Check the nodes that are related to FASTQ files. For the `submitted_unaligned_reads` node, determine that the size is correct, the files are not compressed (`.tar` or `.tar.gz`), and there is a link to `read_group`. For the `read_group` node, make sure that the `is_paired_end` is set to `true` for paired end sequencing and `false` for single end sequencing.
+
+Once complete, clicking the `REQUEST HARMONIZATION` button will indicate to the GDC Team and pipeline automation system that data processing can begin.
+
+### Submit to the GDC for Harmonization
+
+When the project is ready for processing, the submitter will request to submit data to the GDC for Harmonization. If the project is not ready for processing, the project can be re-opened. Then the submitter will be able to upload more data to the project workspace.
+
+The `REQUEST HARMONIZATION` button is available only if the project is in "REVIEW" state. At this point, the user can decide whether to re-open the project to upload more data or to request harmonization of the data to the GDC. When the project is in "REVIEW" the following panel appears on the dashboard:
+
+[![GDC Submission Submit Tab](images/GDC_Submission_Submit_Release_Submit_tab_2_v4.png)](images/GDC_Submission_Submit_Release_Submit_tab_2_v4.png "Click to see the full image.")
+
+Once the user submits data to the GDC, they cannot modify the submitted nodes and files while harmonization is underway. Additional project data can be added during this period and will be considered a separate batch. To process an additional batch the user must again review the data and select `Request Harmonization`.
+
+[![GDC Submission Submission Tab](images/GDC_SUBMIT_TO_GDC_v3.png)](images/GDC_SUBMIT_TO_GDC_v3.png "Click to see the full image.")
+
+When the user clicks on the action `REQUEST HARMONIZATION` on the dashboard, the following popup is displayed:
+
+[![GDC Submission Submit Popup](images/GDC_Submission_Submit_Release_Submit_Popup_v2.png)](images/GDC_Submission_Submit_Release_Submit_Popup_v2.png "Click to see the full image.")
+
+
+After the user clicks on `SUBMIT VALIDATED DATA TO THE GDC`, the project state becomes "Harmonization Requested":
+
+[![GDC Submission Project State](images/GDC_Submission_Submit_Release_Project_State_v3.png)](images/GDC_Submission_Submit_Release_Project_State_v3.png "Click to see the full image.")
+
+The GDC requests that users submit their data to the GDC for harmonization within six months from the first upload of data to the project workspace.
+
+## Release
+Project release occurs after the data has been harmonized, and allows users to access this data with the [GDC Data Portal](https://portal.gdc.cancer.gov/) and other [GDC Data Access Tools](https://gdc.cancer.gov/access-data/data-access-processes-and-tools). The GDC will release data according to [GDC Data Sharing Policies](https://gdc.cancer.gov/submit-data/data-submission-policies). Data must be released within six months after GDC data processing has been completed, or the submitter may request earlier release using the "Request Release" function. A project can only be released once.
+
+[![GDC Submission Release Tab](images/GDC_Submission_Landing_Submitter_4.png)](images/GDC_Submission_Landing_Submitter_4.png "Click to see the full image.")
+
+When the user clicks on the action `REQUEST RELEASE`, the following Release popup is displayed:
+
+[![GDC Submission Release Popup](images/GDC_Submission_Submit_Release_Release_Popup.png)](images/GDC_Submission_Submit_Release_Release_Popup.png "Click to see the full image.")
+
+After the user clicks on `RELEASE SUBMITTED AND PROCESSED DATA`, the project release state becomes "Release Requested":
+
+[![GDC Submission Project State](images/GDC_Submission_Submit_Release_Project_State_3.png)](images/GDC_Submission_Submit_Release_Project_State_3.png "Click to see the full image.")
+
+
+>__Note__: Released cases and/or files can be redacted from the GDC. For more information, visit the [GDC Policies page (under GDC Data Sharing Policies)](https://gdc.cancer.gov/about-gdc/gdc-policies).
diff --git a/docs/Data_Submission_Portal/Users_Guide/Data_Submission_Walkthrough.md b/docs/Data_Submission_Portal/Users_Guide/Data_Submission_Walkthrough.md
new file mode 100644
index 000000000..86e069f93
--- /dev/null
+++ b/docs/Data_Submission_Portal/Users_Guide/Data_Submission_Walkthrough.md
@@ -0,0 +1,698 @@
+# Data Upload Walkthrough
+
+This guide details step-by-step procedures for different aspects of the GDC Data Submission process and how they relate to the GDC Data Model and structure. The first sections of this guide break down the submission process and associate each step with the Data Model. Additional sections are detailed below for strategies on expediting data submission, using features of the GDC Data Submission Portal, and best practices used by the GDC.
+
+## GDC Data Model Basics
+
+Pictured below is the submittable subset of the GDC Data Model: a roadmap for GDC data submission. Each oval node in the graphic represents an entity: a logical unit of data related to a specific clinical, biospecimen, or file facet in the GDC. An entity includes a set of fields, the associated values, and information about its related node associations. All submitted entities require a connection to another entity type, based on the GDC Data Model, and a `submitter_id` as an identifier. This walkthrough will go through the submission of different entities. The completed (submitted) portion of the entity process will be highlighted in __blue__.
+
+[![GDC Data Model 1](images/GDC-Data-Model-None.png)](images/GDC-Data-Model-None.png "Click to see the full image.")
+
+# Case Submission
+
+The `case` is the center of the GDC Data Model and usually describes a specific patient. Each `case` is connected to a `project`. Different types of clinical data, such as `diagnoses` and `exposures`, are connected to the `case` to describe the case's attributes and medical information.
+
+[![GDC Data Model 2](images/GDC-Data-Model-Case.png)](images/GDC-Data-Model-Case.png "Click to see the full image.")
+
+The main entity of the GDC Data Model is the `case`, each of which must be registered beforehand with [dbGaP](https://www.ncbi.nlm.nih.gov/sra/docs/submitdbgap) under a unique `submitter_id`. The first step to submitting a `case` is to consult the [Data Dictionary](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#data-dictionary-viewer), which details the fields that are associated with a `case`, the fields that are required to submit a `case`, and the values that can populate each field. Dictionary entries are available for all entities in the GDC Data Model.
+
+[![Dictionary Case](images/Dictionary_Case.png)](images/Dictionary_Case.png "Click to see the full image.")
+
+Submitting a [__Case__](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=case) entity requires:
+
+* __`submitter_id`:__ A unique key to identify the `case`
+* __`projects.code`:__ A link to the `project`
+
+The submitter ID is different from the universally unique identifier (UUID), which is based on the [UUID Version 4 Naming Convention](https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_.28random.29). The UUID can be accessed under the `_id` field for each entity. For example, the `case` UUID can be accessed under the `case_id` field. The UUID is either assigned to each entity automatically or can be submitted by the user. Submitter-generated UUIDs cannot be uploaded in `submittable_data_file` entity types. See the [Data Model Users Guide](https://docs.gdc.cancer.gov/Data/Data_Model/GDC_Data_Model/#gdc-identifiers) for more details about GDC identifiers.
+
+The `projects.code` field connects the `case` entity to the `project` entity. The rest of the entity connections use the `submitter_id` field instead.
+
+The `case` entity can be added in JSON or TSV format. A template for any entity in either of these formats can be found in the Data Dictionary at the top of each page. Templates populated with `case` metadata in both formats are displayed below.
+
+```JSON
+{
+ "type": "case",
+ "submitter_id": "PROJECT-INTERNAL-000055",
+ "projects": {
+ "code": "INTERNAL"
+ }
+}
+```
+```TSV
+type submitter_id projects.code
+case PROJECT-INTERNAL-000055 INTERNAL
+```
+
+>__Note:__ JSON and TSV formats handle links between entities (`case` and `project`) differently. JSON includes the `code` field nested within `projects` while TSV appends `code` to `projects` with a period.
+
+
+## Uploading the Case Submission File
+
+The file detailed above can be uploaded using the GDC Data Submission Portal and the GDC API as described below:
+
+### Upload Using the GDC Data Submission Portal
+
+An example of a `case` upload is detailed below. The [GDC Data Submission Portal](https://gdc.cancer.gov/submit-data/gdc-data-submission-portal) is equipped with a wizard window to facilitate the upload and validation of entities.
+
+#### 1. Upload Files
+
+Choosing _'UPLOAD'_ from the project dashboard will open the Upload Data Wizard.
+
+[![GDC Submission Wizard Upload Files](images/GDC_Submission_Wizard_Upload_2.png)](images/GDC_Submission_Wizard_Upload_2.png "Click to see the full image.")
+
+Files containing one or more entities can be added either by clicking on `CHOOSE FILE(S)` or using drag and drop. Files can be removed from the Upload Data Wizard by clicking on the garbage can icon that is displayed next to the file after the file is selected for upload.
+
+#### 2. Validate Entities
+
+The __Validate Entities__ stage acts as a safeguard against submitting incorrectly formatted data to the GDC Data Submission Portal. During the validation stage, the GDC API will validate the content of uploaded entities against the Data Dictionary to detect potential errors. Invalid entities will not be processed and must be corrected by the user and re-uploaded before being accepted. A validation error report provided by the system can be used to isolate and correct errors.
+
+When the first file is added, the wizard will move to the Validate section and the user can continue to add files. When all files have been added, choosing `VALIDATE` will run a test to check if the entities are valid for submission.
+
+[![GDC Submission Wizard Validate Files](images/GDC_Submission_Portal_Validate.png)](images/GDC_Submission_Portal_Validate.png "Click to see the full image.")
+
+#### 3. Commit or Discard Files
+If the upload contains valid entities, a new transaction will appear in the latest transactions panel with the option to `COMMIT` or `DISCARD` the data. Entities contained in these files can be committed (applied) to the project or discarded using these two buttons.
+
+If the upload contains invalid files, a transaction will appear with a FAILED status. Invalid files will need to be either corrected and re-uploaded or removed from the submission. If more than one file is uploaded and at least one is not valid, the validation step will fail for all files.
+
+[![Commit_Discard](images/GDC_Submission_CommitDiscard.png)](images/GDC_Submission_CommitDiscard.png "Click to see the full image.")
+
+
+### Upload Using the GDC API
+
+The API has a much broader range of functionality than the Data Wizard. Entities can be created, updated, and deleted through the API. See the [API Submission User Guide](https://docs.gdc.cancer.gov/API/Users_Guide/Submission/#creating-and-updating-entities) for a more detailed explanation and for the rest of the functionalities of the API. Generally, uploading an entity through the API can be performed using a command similar to the following:
+
+```Shell
+curl --header "X-Auth-Token: $token" --request POST --data @CASE.json https://api.gdc.cancer.gov/v0/submission/GDC/INTERNAL/_dry_run?async=true
+```
+CASE.json is detailed below.
+```json
+{
+ "type": "case",
+ "submitter_id": "PROJECT-INTERNAL-000055",
+ "projects": {
+ "code": "INTERNAL"
+ }
+}
+```
+
+In this example, the `_dry_run` marker is used to determine if the entities can be validated, but without committing any information. If a command passed through the `_dry_run` works, the command will work when it is changed to `commit`. For more information please go to [Dry Run Transactions](https://docs.gdc.cancer.gov/API/Users_Guide/Submission/#dry-run-transactions).
+
+>__Note:__ Submission of TSV files is also supported by the GDC API.
+
+Next, the file can either be committed (applied to the project) through the Data Submission Portal as before, or another API query can be performed that will commit the file to the project. The transaction number in the URL (467) is printed to the console during the first step of API submission and can also be retrieved from the [Transactions](Data_Submission_Process.md#transactions) tab in the Data Submission Portal.
+
+```Shell
+curl --header "X-Auth-Token: $token" --request POST https://api.gdc.cancer.gov/v0/submission/GDC/INTERNAL/transactions/467/commit?async=true
+```
+
+# Clinical Data Submission
+
+Typically, a submission project will include additional information about a `case` such as `demographic`, `diagnosis`, or `exposure` data.
+
+## Clinical Data Requirements
+
+For the GDC to release a project there is a minimum number of clinical properties that are required. Minimal GDC requirements for each project includes age, gender, and diagnosis information. Other [requirements](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-entity-list&anchor=clinical) may be added when the submitter is approved for submission to the GDC.
+
+[![GDC Data Model Clinical](images/GDC-Data-Model-Clinical.png)](images/GDC-Data-Model-Clinical.png "Click to see the full image.")
+
+## Submitting a Demographic Entity to a Case
+
+The `demographic` entity contains information that characterizes the `case` entity.
+
+Submitting a [__Demographic__](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=demographic) entity requires:
+
+* __`submitter_id`:__ A unique key to identify the `demographic` entity.
+* __`cases.submitter_id`:__ The unique key that was used for the `case` that links the `demographic` entity to the `case`.
+* __`ethnicity`:__ An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.
+* __`gender`:__ Text designations that identify gender. Gender is described as the assemblage of properties that distinguish people on the basis of their societal roles.
+* __`race`:__ An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.
+
+```JSON
+{
+ "type": "demographic",
+ "submitter_id": "PROJECT-INTERNAL-000055-DEMOGRAPHIC-1",
+ "cases": {
+ "submitter_id": "PROJECT-INTERNAL-000055"
+ },
+ "ethnicity": "not hispanic or latino",
+ "gender": "male",
+ "race": "asian",
+}
+```
+```TSV
+type cases.submitter_id ethnicity gender race
+demographic PROJECT-INTERNAL-000055 not hispanic or latino male asian
+```
+
+## Submitting a Diagnosis Entity to a Case
+
+Submitting a [__Diagnosis__](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=diagnosis) entity requires:
+
+* __`submitter_id`:__ A unique key to identify the `diagnosis` entity.
+* __`cases.submitter_id`:__ The unique key that was used for the `case` that links the `diagnosis` entity to the `case`.
+* __`age_at_diagnosis`:__ Age at the time of diagnosis expressed in number of days since birth.
+* __`days_to_last_follow_up`:__ Time interval from the date of last follow up to the date of initial pathologic diagnosis, represented as a calculated number of days.
+* __`days_to_last_known_disease_status`:__ Time interval from the date of last follow up to the date of initial pathologic diagnosis, represented as a calculated number of days.
+* __`days_to_recurrence`:__ Time interval from the date of new tumor event including progression, recurrence and new primary malignancies to the date of initial pathologic diagnosis, represented as a calculated number of days.
+* __`last_known_disease_status`:__ The state or condition of an individual's neoplasm at a particular point in time.
+* __`morphology`:__ The third edition of the International Classification of Diseases for Oncology, published in 2000 used principally in tumor and cancer registries for coding the site (topography) and the histology (morphology) of neoplasms. The study of the structure of the cells and their arrangement to constitute tissues and, finally, the association among these to form organs. In pathology, the microscopic process of identifying normal and abnormal morphologic characteristics in tissues, by employing various cytochemical and immunocytochemical stains. A system of numbered categories for representation of data.
+* __`primary_diagnosis`:__ Text term for the structural pattern of cancer cells used to define a microscopic diagnosis.
+* __`progression_or_recurrence`:__ Yes/No/Unknown indicator to identify whether a patient has had a new tumor event after initial treatment.
+* __`site_of_resection_or_biopsy`:__ The third edition of the International Classification of Diseases for Oncology, published in 2000, used principally in tumor and cancer registries for coding the site (topography) and the histology (morphology) of neoplasms. The description of an anatomical region or of a body part. Named locations of, or within, the body. A system of numbered categories for representation of data.
+* __`tissue_or_organ_of_origin`:__ Text term that describes the anatomic site of the tumor or disease.
+* __`tumor_grade`:__ Numeric value to express the degree of abnormality of cancer cells, a measure of differentiation and aggressiveness.
+* __`tumor_stage`:__ The extent of a cancer in the body. Staging is usually based on the size of the tumor, whether lymph nodes contain cancer, and whether the cancer has spread from the original site to other parts of the body. The accepted values for tumor_stage depend on the tumor site, type, and accepted staging system. These items should accompany the tumor_stage value as associated metadata.
+* __`vital_status`:__ The survival state of the person registered on the protocol.
+
+```JSON
+{
+ "type": "diagnosis",
+ "submitter_id": "PROJECT-INTERNAL-000055-DIAGNOSIS-1",
+ "cases": {
+ "submitter_id": "GDC-INTERNAL-000099"
+ },
+ "age_at_diagnosis": 10256,
+ "days_to_last_follow_up": 34,
+ "days_to_last_known_disease_status": 34,
+ "days_to_recurrence": 45,
+ "last_known_disease_status": "Tumor free",
+ "morphology": "8260/3",
+ "primary_diagnosis": "ACTH-producing tumor",
+ "progression_or_recurrence": "no",
+ "site_of_resection_or_biopsy": "Lung, NOS",
+ "tissue_or_organ_of_origin": "Lung, NOS",
+ "tumor_grade": "not reported",
+ "tumor_stage": "stage i",
+ "vital_status": "alive"
+}
+```
+```TSV
+type submitter_id cases.submitter_id age_at_diagnosis days_to_last_follow_up days_to_last_known_disease_status days_to_recurrence last_known_disease_status morphology primary_diagnosis progression_or_recurrence site_of_resection_or_biopsy tissue_or_organ_of_origin tumor_grade tumor_stage vital_status
+diagnosis PROJECT-INTERNAL-000055-DIAGNOSIS-1 GDC-INTERNAL-000099 10256 34 34 45 Tumor free 8260/3 ACTH-producing tumor no Lung, NOS Lung, NOS not reported stage i alive
+```
+
+### Submitting an Exposure Entity to a Case
+
+Submitting an [__Exposure__](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=exposure) entity does not require any information besides a link to the `case` and a `submitter_id`. The following fields are optionally included:
+
+* __`alcohol_history`:__ A response to a question that asks whether the participant has consumed at least 12 drinks of any kind of alcoholic beverage in their lifetime.
+* __`alcohol_intensity`:__ Category to describe the patient's current level of alcohol use as self-reported by the patient.
+* __`alcohol_days_per_week`:__ Numeric value used to describe the average number of days each week that a person consumes an alchoolic beverage.
+* __`years_smoked`:__ Numeric value (or unknown) to represent the number of years a person has been smoking.
+* __`tobacco_smoking_onset_year`:__ The year in which the participant began smoking.
+* __`tobacco_smoking_quit_year`:__ The year in which the participant quit smoking.
+
+```JSON
+{
+ "type": "exposure",
+ "submitter_id": "PROJECT-INTERNAL-000055-EXPOSURE-1",
+ "cases": {
+ "submitter_id": "PROJECT-INTERNAL-000055"
+ },
+ "alcohol_history": "yes",
+ "alcohol_intensity": "Drinker",
+ "alcohol_days_per_week": 2,
+ "years_smoked": 5,
+ "tobacco_smoking_onset_year": 2007,
+ "tobacco_smoking_quit_year": 2012
+}
+```
+```TSV
+type submitter_id cases.submitter_id alcohol_history alcohol_intensity alcohol_days_per_week years_smoked tobacco_smoking_onset_year tobacco_smoking_quit_year
+exposure PROJECT-INTERNAL-000055-EXPOSURE-1 PROJECT-INTERNAL-000055 yes Drinker 2 5 2007 2012
+```
+
+>__Note:__ Submitting a clinical entity uses the same conventions as submitting a `case` entity (detailed above).
+
+
+# Biospecimen Submission
+
+One of the main features of the GDC is the genomic data harmonization workflow. Genomic data is connected the case through biospecimen entities. The `sample` entity describes a biological piece of matter that originated from a `case`. Subsets of the `sample` such as `portions` and `analytes` can optionally be described. The `aliquot` originates from a `sample` or `analyte` and describes the nucleic acid extract that was sequenced. The `read_group` entity describes the resulting set of reads from one sequencing lane.
+
+## Sample Submission
+
+[![GDC Data Model 3](images/GDC-Data-Model-Sample.png)](images/GDC-Data-Model-Sample.png "Click to see the full image.")
+
+A `sample` submission has the same general structure as a `case` submission as it will require a unique key and a link to the `case`. However, `sample` entities require one additional value: `sample_type`. This peripheral data is required because it is necessary for the data to be interpreted. For example, an investigator using this data would need to know whether the `sample` came from tumor or normal tissue.
+
+[![Dictionary Sample](images/Dictionary_Sample.png)](images/Dictionary_Sample.png "Click to see the full image.")
+
+Submitting a [__Sample__](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=sample) entity requires:
+
+* __`submitter_id`:__ A unique key to identify the `sample`.
+* __`cases.submitter_id`:__ The unique key that was used for the `case` that links the `sample` to the `case`.
+* __`sample_type`:__ Type of the `sample`. Named for its cellular source, molecular composition, and/or therapeutic treatment.
+* __`tissue_type`:__ Text term that represents a description of the kind of tissue collected with respect to disease status or proximity to tumor tissue.
+
+>__Note:__ The `case` must be "committed" to the project before a `sample` can be linked to it. This also applies to all other links between entities.
+
+```JSON
+{
+ "type": "sample",
+ "cases": {
+ "submitter_id": "PROJECT-INTERNAL-000055"
+ },
+ "sample_type": "Blood Derived Normal",
+ "submitter_id": "Blood-00001SAMPLE_55"
+ "tissue_type": "Normal"
+}
+```
+```TSV
+type cases.submitter_id submitter_id sample_type tissue_type
+sample PROJECT-INTERNAL-000055 Blood-00001SAMPLE_55 Blood Derived Normal Normal
+```
+
+## Portion, Analyte and Aliquot Submission
+
+[![GDC Data Model 4](images/GDC-Data-Model-Aliquot.png)](images/GDC-Data-Model-Aliquot.png "Click to see the full image.")
+
+Submitting a [__Portion__](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=portion) entity requires:
+
+* __`submitter_id`:__ A unique key to identify the `portion`.
+* __`samples.submitter_id`:__ The unique key that was used for the `sample` that links the `portion` to the `sample`.
+
+```JSON
+{
+ "type": "portion",
+ "submitter_id": "Blood-portion-000055",
+ "samples": {
+ "submitter_id": "Blood-00001SAMPLE_55"
+ }
+}
+
+```
+```TSV
+type submitter_id samples.submitter_id
+portion Blood-portion-000055 Blood-00001SAMPLE_55
+```
+
+Submitting an [__Analyte__](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=analyte) entity requires:
+
+* __`submitter_id`:__ A unique key to identify the `analyte`.
+* __`portions.submitter_id`:__ The unique key that was used for the `portion` that links the `analyte` to the `portion`.
+* __`analyte_type`:__ Text term that represents the kind of molecular specimen analyte.
+
+```JSON
+{
+ "type": "analyte",
+ "portions": {
+ "submitter_id": "Blood-portion-000055"
+ },
+ "analyte_type": "DNA",
+ "submitter_id": "Blood-analyte-000055"
+}
+
+```
+```TSV
+type portions.submitter_id analyte_type submitter_id
+analyte Blood-portion-000055 DNA Blood-analyte-000055
+```
+
+Submitting an [__Aliquot__](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=aliquot) entity requires:
+
+* __`submitter_id`:__ A unique key to identify the `aliquot`.
+* __`analytes.submitter_id`:__ The unique key that was used for the `analyte` that links the `aliquot` to the `analyte`.
+
+```JSON
+{
+ "type": "aliquot",
+ "submitter_id": "Blood-00021-aliquot55",
+ "analytes": {
+ "submitter_id": "Blood-analyte-000055"
+ }
+}
+
+```
+```TSV
+type submitter_id analytes.submitter_id
+aliquot Blood-00021-aliquot55 Blood-analyte-000055
+```
+
+>__Note:__ `aliquot` entities can be directly linked to `sample` entities via the `samples.submitter_id`. The `portion` and `analyte` entities are not required for submission.
+
+## Read Group Submission
+
+[![GDC Data Model 5](images/GDC-Data-Model-RG.png)](images/GDC-Data-Model-RG.png "Click to see the full image.")
+
+Information about sequencing reads is necessary for downstream analysis, thus the `read_group` entity requires more fields than the other Biospecimen entities (`sample`, `portion`, `analyte`, `aliquot`).
+
+Submitting a [__Read Group__](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=read_group) entity requires:
+
+* __`submitter_id`:__ A unique key to identify the `read_group`.
+* __`aliquots.submitter_id`:__ The unique key that was used for the `aliquot` that links the `read_group` to the `aliquot`.
+* __`experiment_name`:__ Submitter-defined name for the experiment.
+* __`is_paired_end`:__ Are the reads paired end? (Boolean value: `true` or `false`).
+* __`library_name`:__ Name of the library.
+* __`library_strategy`:__ Library strategy.
+* __`platform`:__ Name of the platform used to obtain data.
+* __`read_group_name`:__ The name of the `read_group`.
+* __`read_length`:__ The length of the reads (integer).
+* __`sequencing_center`:__ Name of the center that provided the sequence files.
+* __`library_selection`:__ Library Selection Method.
+* __`target_capture_kit`:__ Description that can uniquely identify a target capture kit. Suggested value is a combination of vendor, kit name, and kit version.
+
+```JSON
+{
+ "type": "read_group",
+ "submitter_id": "Blood-00001-aliquot_lane1_barcodeACGTAC_55",
+ "experiment_name": "Resequencing",
+ "is_paired_end": true,
+ "library_name": "Solexa-34688",
+ "library_strategy": "WXS",
+ "platform": "Illumina",
+ "read_group_name": "205DD.3-2",
+ "read_length": 75,
+ "sequencing_center": "BI",
+ "library_selection": "Hybrid Selection",
+ "target_capture_kit": "Custom MSK IMPACT Panel - 468 Genes",
+ "aliquots":
+ {
+ "submitter_id": "Blood-00021-aliquot55"
+ }
+}
+
+```
+```TSV
+type submitter_id experiment_name is_paired_end library_name library_selection library_strategy platform read_group_name read_length sequencing_center target_capture_kit aliquots.submitter_id
+read_group Blood-00001-aliquot_lane1_barcodeACGTAC_55 Resequencing true Solexa-34688 Hybrid Selection WXS Illumina 205DD.3-2 75 BI Custom MSK IMPACT Panel - 468 Genes Blood-00021-aliquot55
+```
+
+>__Note:__ Submitting a biospecimen entity uses the same conventions as submitting a `case` entity (detailed above).
+
+# Experiment Data Submission
+
+Several types of experiment data can be uploaded to the GDC. The `submitted_aligned_reads` and `submitted_unaligned_reads` files are associated with the `read_group` entity, while the array-based files such as the `submitted_tangent_copy_number` are associated with the `aliquot` entity. Each of these file types are described in their respective entity submission and are uploaded separately using the [GDC API](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/) or the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool).
+
+[![GDC Data Model 6](images/GDC-Data-Model-Reads.png)](images/GDC-Data-Model-Reads.png "Click to see the full image.")
+
+Before the experiment data file can be submitted, the GDC requires that the user provides information about the file as a `submittable_data_file` entity. This includes file-specific data needed to validate the file and assess which analyses should be performed. Sequencing data files can be submitted as `submitted_aligned_reads` or `submitted_unaligned_reads`.
+
+Submitting a [__Submitted Aligned-Reads__](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=submitted_aligned_reads) ([__Submitted Unaligned-Reads__](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=submitted_unaligned_reads)) entity requires:
+
+* __`submitter_id`:__ A unique key to identify the `submitted_aligned_reads`.
+* __`read_groups.submitter_id`:__ The unique key that was used for the `read_group` that links the `submitted_aligned_reads` to the `read_group`.
+* __`data_category`:__ Broad categorization of the contents of the data file.
+* __`data_format`:__ Format of the data files.
+* __`data_type`:__ Specific content type of the data file. (must be "Aligned Reads").
+* __`experimental_strategy`:__ The sequencing strategy used to generate the data file.
+* __`file_name`:__ The name (or part of a name) of a file (of any type).
+* __`file_size`:__ The size of the data file (object) in bytes.
+* __`md5sum`:__ The 128-bit hash value expressed as a 32 digit hexadecimal number used as a file's digital fingerprint.
+
+
+```JSON
+{
+ "type": "submitted_aligned_reads",
+ "submitter_id": "Blood-00001-aliquot_lane1_barcodeACGTAC_55.bam",
+ "data_category": "Raw Sequencing Data",
+ "data_format": "BAM",
+ "data_type": "Aligned Reads",
+ "experimental_strategy": "WGS",
+ "file_name": "test.bam",
+ "file_size": 38,
+ "md5sum": "aa6e82d11ccd8452f813a15a6d84faf1",
+ "read_groups": [
+ {
+ "submitter_id": "Primary_Tumor_RG_86-1"
+ }
+ ]
+}
+```
+```TSV
+type submitter_id data_category data_format data_type experimental_strategy file_name file_size md5sum read_groups.submitter_id#1
+submitted_aligned_reads Blood-00001-aliquot_lane1_barcodeACGTAC_55.bam Raw Sequencing Data BAM Aligned Reads WGS test.bam 38 aa6e82d11ccd8452f813a15a6d84faf1 Primary_Tumor_RG_86-1
+```
+
+>__Note:__ For details on submitting experiment data associated with more than one `read_group` entity, see the [Tips for Complex Submissions](#submitting-complex-data-model-relationships) section.
+
+## Uploading the Submittable Data File to the GDC
+
+The submittable data file can be uploaded when it is registered with the GDC. An submittable data file is registered when its corresponding entity (e.g. `submitted_unaligned_reads`) is uploaded and committed. It is important to not that the Harmonization process does not occur on these submitted files until the user clicks the [`Request Submission`](Data_Submission_Process.md#release) button. Uploading the file can be performed with either the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool) or the [GDC API](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/). Other types of data files such as clinical supplements, biospecimen supplements, and pathology reports are uploaded to the GDC in the same way. Supported data file formats are listed at the GDC [Submitted Data Types and File Formats](https://gdc.cancer.gov/about-data/data-types-and-file-formats/submitted-data-types-and-file-formats) website.
+
+__GDC Data Transfer Tool:__ A file can be uploaded using its UUID (which can be retrieved from the GDC Submission Portal or API) once it is registered.
+
+[![UUID Location](images/GDC_Submission_UUID_location.png)](images/GDC_Submission_UUID_location.png "Click to see the full image.")
+
+The following command can be used to upload the file:
+
+```Shell
+gdc-client upload --project-id PROJECT-INTERNAL --identifier a053fad1-adc9-4f2d-8632-923579128985 -t $token -f $path_to_file
+```
+
+Additionally a manifest can be downloaded from the Submission Portal and passed to the Data Transfer Tool. This will allow for the upload of more than one `submittable_data_file`:
+
+```Shell
+gdc-client upload -m manifest.yml -t $token
+```
+__API Upload:__ A `submittable_data_file` can be uploaded through the API by using the `/submission/$PROGRAM/$PROJECT/files` endpoint. The following command would be typically used to upload a file:
+
+```Shell
+curl --request PUT --header "X-Auth-Token: $token" https://api.gdc.cancer.gov/v0/submission/PROJECT/INTERNAL/files/6d45f2a0-8161-42e3-97e6-e058ac18f3f3 -d $path_to_file
+
+```
+
+For more details on how to upload a `submittable_data_file` to a project see the [API Users Guide](https://docs.gdc.cancer.gov/API/Users_Guide/Submission/) and the [Data Transfer Tool Users Guide](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/).
+
+## Annotation Submission
+
+The GDC Data Portal supports the use of annotations for any submitted entity or file. An annotation entity may include comments about why particular patients or samples are not present or why they may exhibit critical differences from others. Annotations include information that cannot be submitted to the GDC through other existing nodes or properties.
+
+If a submitter would like to create an annotation, please contact the GDC Support Team (support@nci-gdc.datacommons.io).
+
+## Deleting Submitted Entities
+
+The GDC Data Submission Portal allows users to delete submitted entities from the project when the project is in an "OPEN" state. Files cannot be deleted while in the "SUBMITTED" state. This section applies to entities that have been committed to the project. Entities that have not been committed can be removed from the project by choosing the `DISCARD` button. Entities can also be deleted using the API. See the [API Submission Documentation](https://docs.gdc.cancer.gov/API/Users_Guide/Submission/#deleting-entities) for specific instructions.
+
+>__NOTE:__ Entities associated with files uploaded to the GDC object store cannot be deleted until the associated file has been deleted. Users must utilize the [GDC Data Transfer Tool](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/#deleting-previously-uploaded-data) to delete these files first.
+
+### Simple Deletion
+
+If an entity was uploaded and has no related entities, it can be deleted from the [Browse](Data_Submission_Process.md#browse) tab. Once the entity to be deleted is selected, choose the `DELETE` button in the right panel under "ACTIONS".
+
+
+[![GDC Delete Unassociated Case](images/GDC-Delete-Case-Unassociated.png)](images/GDC-Delete-Case-Unassociated.png "Click to see the full image.")
+
+
+A message will then appear asking if you are sure about deleting the entity. Choosing the `YES, DELETE` button will remove the entity from the project, whereas choosing the `NO, CANCEL` button will return the user to the previous screen.
+
+
+[![GDC Yes or No](images/GDC-Delete-Sure.png)](images/GDC-Delete-Sure.png "Click to see the full image.")
+
+
+### Deletion with Dependents
+
+If an entity has related entities, such as a `case` with multiple `samples` and `aliquots`, deletion takes one extra step.
+
+
+[![GDC Delete Associated Case](images/GDC-Delete-Case-Associated.png)](images/GDC-Delete-Case-Associated.png "Click to see the full image.")
+
+
+Follow the [Simple Deletion](Data_Submission_Walkthrough.md#simple-deletion) method until the end. This action will appear in the [Transactions](Data_Submission_Process.md#transactions) tab as "Delete" with a "FAILED" state.
+
+
+[![GDC Delete Failed](images/GDC-Failed-Transaction.png)](images/GDC-Failed-Transaction.png "Click to see the full image.")
+
+
+Choose the failed transaction and the right panel will show the list of entities related to the entity that was going to be deleted.
+
+
+[![GDC Error Related](images/GDC-Error-Related.png)](images/GDC-Error-Related.png "Click to see the full image.")
+
+
+Selecting the `DELETE ALL` button at the bottom of the list will delete all of the related entities, their descendants, and the original entity.
+
+
+### Submitted Data File Deletion
+
+The [`submittable_data_files`](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-entity-list&anchor=submittable_data_file) that were uploaded erroneously are deleted separately from their associated entity using the GDC Data Transfer Tool. See the section on [Deleting Data Files](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/#deleting-previously-uploaded-data) in the Data Transfer Tool users guide for specific instructions.
+
+## Updating Uploaded Entities
+
+Before harmonization occurs, entities can be modified to update, add, or delete information. These methods are outlined below.
+
+### Updating or Adding Fields
+
+Updated or additional fields can be applied to entities by re-uploading them through the GDC Data Submission portal or API. See below for an example of a case upload with a `primary_site` field being added and a `disease_type` field being updated.
+
+```Before
+{
+"type":"case",
+"submitter_id":"GDC-INTERNAL-000043",
+"projects":{
+ "code":"INTERNAL"
+},
+"disease_type": "Myomatous Neoplasms"
+}
+```
+```After
+{
+"type":"case",
+"submitter_id":"GDC-INTERNAL-000043",
+"projects":{
+ "code":"INTERNAL"
+},
+"disease_type": "Myxomatous Neoplasms",
+"primary_site": "Pancreas"
+}
+```
+__Guidelines:__
+
+* The newly uploaded entity must contain the `submitter_id` of the existing entity so that the system updates the correct one.
+* All newly updated entities will be validated by the GDC Dictionary. All required fields must be present in the newly updated entity.
+* Fields that are not required do not need to be re-uploaded and will remain unchanged in the entity unless they are updated.
+
+### Deleting Optional Fields
+
+It may be necessary to delete fields from uploaded entities. This can be performed through the API and can only be applied to optional fields. It also requires the UUID of the entity, which can be retrieved from the submission portal or using a GraphQL query.
+
+In the example below, the `primary_site` and `disease_type` fields are removed from a `case` entity:
+
+```Shell
+curl --header "X-Auth-Token: $token_string" --request DELETE --header "Content-Type: application/json" "https://api.gdc.cancer.gov/v0/submission/EXAMPLE/PROJECT/entities/7aab7578-34ff-5651-89bb-57aefdc4c4f8?fields=primary_site,disease_type"
+```
+
+```Before
+{
+"type":"case",
+"submitter_id":"GDC-INTERNAL-000043",
+"projects":{
+ "code":"INTERNAL"
+},
+"disease_type": "Germ Cell Neoplasms",
+"primary_site": "Pancreas"
+}
+```
+```After
+{
+"type":"case",
+"submitter_id":"GDC-INTERNAL-000043",
+"projects":{
+ "code":"INTERNAL"
+}
+}
+```
+
+### Versioning
+Changes to entities will create versions. For more information on this, please go to [Uploading New Versions of Data Files](https://docs.gdc.cancer.gov/API/Users_Guide/Submission/#uploading-new-versions-of-data-files).
+
+## Strategies for Submitting in Bulk
+
+Each submission in the previous sections was broken down by component to demonstrate the GDC Data Model structure. However, the submission of multiple entities at once is supported and encouraged. Here two strategies for submitting data in an efficient manner are discussed.
+
+### Registering a BAM File: One Step
+
+Registering a BAM file (or any other type) can be performed in one step by including all of the entities, from `case` to `submitted_aligned_reads`, in one file. See the example below:
+
+```JSON
+[{
+ "type": "case",
+ "submitter_id": "PROJECT-INTERNAL-000055",
+ "projects": {
+ "code": "INTERNAL"
+ }
+},
+{
+ "type": "sample",
+ "cases": {
+ "submitter_id": "PROJECT-INTERNAL-000055"
+ },
+ "sample_type": "Blood Derived Normal",
+ "submitter_id": "Blood-00001_55"
+},
+{
+ "type": "portion",
+ "submitter_id": "Blood-portion-000055",
+ "samples": {
+ "submitter_id": "Blood-00001_55"
+ }
+},
+{
+ "type": "analyte",
+ "portions": {
+ "submitter_id": "Blood-portion-000055"
+ },
+ "analyte_type": "DNA",
+ "submitter_id": "Blood-analyte-000055"
+},
+{
+ "type": "aliquot",
+ "submitter_id": "Blood-00021-aliquot55",
+ "analytes": {
+ "submitter_id": "Blood-analyte-000055"
+ }
+},
+{
+ "type": "read_group",
+ "submitter_id": "Blood-00001-aliquot_lane1_barcodeACGTAC_55",
+ "experiment_name": "Resequencing",
+ "is_paired_end": true,
+ "library_name": "Solexa-34688",
+ "library_selection":"Hybrid Selection",
+ "library_strategy": "WXS",
+ "platform": "Illumina",
+ "read_group_name": "205DD.3-2",
+ "read_length": 75,
+ "sequencing_center": "BI",
+ "aliquots":
+ {
+ "submitter_id": "Blood-00021-aliquot55"
+ }
+},
+{
+ "type": "submitted_aligned_reads",
+ "submitter_id": "Blood-00001-aliquot_lane1_barcodeACGTAC_55.bam",
+ "data_category": "Raw Sequencing Data",
+ "data_format": "BAM",
+ "data_type": "Aligned Reads",
+ "experimental_strategy": "WGS",
+ "file_name": "test.bam",
+ "file_size": 38,
+ "md5sum": "aa6e82d11ccd8452f813a15a6d84faf1",
+ "read_groups": [
+ {
+ "submitter_id": "Blood-00001-aliquot_lane1_barcodeACGTAC_55"
+ }
+ ]
+}]
+```
+
+All of the entities are placed into a JSON list object:
+
+`[{"type": "case","submitter_id": "PROJECT-INTERNAL-000055","projects": {"code": "INTERNAL"}}}, entity-2, entity-3]`
+
+The entities need not be in any particular order as they are validated together.
+
+>__Note:__ Tab-delimited format is not recommended for 'one-step' submissions due to an inability of the format to accommodate multiple 'types' in one row.
+
+### Submitting Numerous Cases
+
+The GDC understands that submitters will have projects that comprise more entities than would be reasonable to individually parse into JSON formatted files. Additionally, many investigators store large amounts of data in a tab-delimited format (TSV). For instances like this, we recommend parsing all entities of the same type into separate TSVs and submitting them on a type-basis.
+
+For example, a user may want to submit 100 Cases associated with 100 `samples`, 100 `portions`, 100 `analytes`, 100 `aliquots`, and 100 `read_groups`. Constructing and submitting 100 JSON files would be tedious and difficult to organize. The solution is submitting one `case` TSV containing the 100 `cases`, one `sample` TSV containing the 100 `samples`, so on and so forth. Doing this would only require six TSVs and these files can be formatted in programs such as Microsoft Excel or Google Spreadsheets.
+
+See the following example TSV files:
+
+* [Cases.tsv](Cases.tsv)
+* [Samples.tsv](Samples.tsv)
+* [Portions.tsv](Portions.tsv)
+* [Analytes.tsv](Analytes.tsv)
+* [Aliquots.tsv](Aliquots.tsv)
+* [Read-Groups.tsv](Readgroups.tsv)
+
+### Download Previously Uploaded Metadata Files
+
+The [transaction](Data_Submission_Process.md#transactions) page lists all previous transactions in the project. The user can download metadata files uploaded to the GDC workspace in the details section of the screen by selecting one transaction and scrolling to the "DOCUMENTS" section.
+
+
+[![Transaction Original Files](images/GDC_Submission_Transactions_Original_Files_2.png)](images/GDC_Submission_Transactions_Original_Files_2.png "Click to see the full image.")
+
+### Download Previously Uploaded Data Files
+
+The only supported method to download data files previously uploaded to the GDC Submission Portal that have not been release yet is to use the API or the [Data Transfer Tool](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Getting_Started/). To retrieve data previous upload to the submission portal you will need to retrieve the data file's UUID. The UUIDs for submitted data files are located in the submission portal under the file's Summary section as well as the manifest file located on the file's Summary page.
+
+[![Submission Portal Summary View](images/gdc-submission__image2_submission_UUID.png)](images/gdc-submission__image2_submission_UUID.png "Click to see the full image.")
+
+Once the UUID(s) have been retrieved, the download process is the same as it is for downloading data files at the [GDC Portal using UUIDs](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/#downloading-data-using-gdc-file-uuids).
+
+ >__Note:__ When submittable data files are uploaded through the Data Transfer Tool they are not displayed as transactions.
diff --git a/docs/Data_Submission_Portal/Users_Guide/Getting_Started.md b/docs/Data_Submission_Portal/Users_Guide/Getting_Started.md
deleted file mode 100644
index 5de77ef20..000000000
--- a/docs/Data_Submission_Portal/Users_Guide/Getting_Started.md
+++ /dev/null
@@ -1,66 +0,0 @@
-# Getting Started
-
-## Overview
-
-The National Cancer Institute (NCI) Genomic Data Commons (GDC) Data Submission Portal User's Guide is the companion documentation for the [GDC Data Submission Portal](https://gdc.cancer.gov/submit-data/gdc-data-submission-portal) and provides detailed information and instructions for its use.
-
-The GDC Data Submission Portal is a platform that allows researchers to submit and release data to the GDC. The key features of the GDC Data Submission Portal are:
-
-* __Upload and Validate Data__: Project data can be uploaded to the GDC project workspace. The GDC will validate the data against the [GDC Data Dictionary](https://gdc-docs.nci.nih.gov/Data_Dictionary/).
-* __Review and Submit Data__: Prior to submission, data can be reviewed to check for accuracy. Once the review is complete, the data can be submitted to the GDC for processing through [Data Harmonization](https://gdc.cancer.gov/submit-data/gdc-data-harmonization).
-* __Release Data__: After harmonization, data can be released to the research community for access through [GDC Data Access Tools](https://gdc.cancer.gov/access-data/data-access-processes-and-tools).
-* __Download Data__: Data that has been uploaded into the project workspace can be downloaded for review or update. Data can then be re-uploaded before it is released for access through [GDC Data Access Tools](https://gdc.cancer.gov/access-data/data-access-processes-and-tools).
-* __Browse Data__: Data that has been uploaded to the project workspace can be browsed to ensure that the project is ready for processing.
-* __Status and Alerts__: Visual cues are implemented to easily identify incomplete submissions.
-
-
-## Key Features
-
-### Upload and Validate Data
-To submit data to the GDC, the user will prepare the data and upload it to the project workspace.
-
-The main categories of data that can be uploaded include:
-
-* __Clinical Data__: Elements such as `gender`, `age`, `diagnosis`, etc. as defined in the GDC Data Dictionary.
-* __Biospecimen Data__: Information about entities such as `samples`, `aliquots`, etc. as defined in the GDC Data Dictionary.
-* __Submittable Data Files__: Sequencing data such as BAM and FASTQ files, slide images, and other experimental data collected by the study.
-
-The [GDC Data Dictionary Viewer](../../Data_Dictionary/viewer.md) outlines the minimum field requirements for each of the three categories listed above.
-
-### Review and Submit Data
-
-Once data is uploaded to the project workspace, it can be reviewed to ensure that the data is ready for processing through the [GDC Harmonization Process](https://gdc.cancer.gov/submit-data/gdc-data-harmonization). The review will lock the project to ensure that additional data cannot be uploaded while in review. During this period the data can be browsed or downloaded in the Data Submission Portal.
-
-If the project is ready for processing, data can be submitted to the GDC. If the project is not ready for processing, the project can be re-opened. This will allow for additional data to be uploaded to the project workspace.
-
-### Release Data
-
-The GDC will release data according to [GDC data sharing policies](https://gdc.cancer.gov/submit-data/data-submission-policies). Data may be released after six months from the date of upload, or the submitter may request earlier release using the "Request Release" function.
-
-Upon release, harmonized data will be available to GDC users through the [GDC Data Portal](https://portal.gdc.cancer.gov/) and other [GDC Data Access Tools](https://gdc.cancer.gov/access-data/data-access-processes-and-tools).
-
-
-### Redaction
-
-Data uploaded to the GDC can be updated before it is submitted for processing and harmonization. After harmonized data is released, it can only be redacted by GDC administrators under certain conditions. To request redaction of released data, please contact [GDC User Services](https://gdc.cancer.gov/support#gdc-help-desk).
-
-### Browse and Download Data
-
-Authorized submitters can browse and retrieve data submitted to their project using the Data Submission Portal. Retrieval of data submitted to the submission portal can be accomplished by using the API or the Data Transfer Tool. UUIDs of submitted files can be retrieved from the submission portal or with a [GraphQL](https://docs.gdc.cancer.gov/API/Users_Guide/Submission/#querying-submitted-data-using-graphql) query. Please see the [API](https://docs.gdc.cancer.gov/API/Users_Guide/Downloading_Files/) documentation for more information about downloads.
-
-
-### Status and Alerts
-
-The GDC Data Submission Portal Dashboard and navigation panel displays a summary of submitted data and associated data elements, such as the number of cases with Clinical data or Biospecimen data.
-
-### Transactions
-
-Submitters can access a list of all actions performed in a project by clicking on the Transactions tab on the dashboard. This will display a list of all past transactions for the selected project. Users can access details about each transaction. The most recent transactions are also displayed on the dashboard.
-
-### Submission Project Examples
-
-Step-by-step instructions on GDC data submission and their relationship to the GDC Data Model are detailed in the [Upload Data](Data_Upload_UG.md) guide.
-
-## Release Notes
-
-The [Release Notes](../../Data_Submission_Portal/Release_Notes/Data_Submission_Portal_Release_Notes.md) section of this User's Guide contains details about new features, bug fixes, and known issues.
diff --git a/docs/Data_Submission_Portal/Users_Guide/Pre_Release_QC.md b/docs/Data_Submission_Portal/Users_Guide/Pre_Release_QC.md
index 291cc668a..0996e4c69 100644
--- a/docs/Data_Submission_Portal/Users_Guide/Pre_Release_QC.md
+++ b/docs/Data_Submission_Portal/Users_Guide/Pre_Release_QC.md
@@ -1,90 +1,44 @@
# Pre-Release Data Portal
-
-## Getting Started
-
-
-### The GDC Pre-Release Data Portal: An Overview
-
-The Genomic Data Commons (GDC) Portal provides users with web-based access to pre-released data from cancer genomics studies that have been harmonized by the GDC, but not yet released in the main GDC Data Portal. Key GDC Pre-Release Data Portal features include:
-
-* Access to data prior to release on the GDC Data Portal.
-* Repository page for browsing data by project / file / case
-* File / case faceted searches to filter data
-* Cart for collecting data files of interest
-* Authentication using eRA Commons credentials for access to controlled-access data files
-* Secure data download directly from the cart or using the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool)
-* Use of API for query and download
-
-
-
+The [GDC Pre-Release Data Portal](https://portal.awg.gdc.cancer.gov/) provides users with web-based access to pre-released data from cancer genomics studies that have been harmonized by the GDC, but not yet released in the main GDC Data Portal.
## Navigation
+[Pre-Release Data Portal](https://portal.awg.gdc.cancer.gov/) will appear similar to the GDC Active Portal, but the Pre-Release Data Portal features are a subset of what can be found in the GDC Data Portal.
-Pre-Release Data Portal features are a subset of what can be found in the GDC Data Portal. For more information on any of these general features please review the [Data Portal User Guide](/Data_Portal/Users_Guide/Getting_Started/#navigation).
-
-[![GDC Views](images/AWG_Portal.png)](images/WG_Portal.png "Click to see the full image.")
-
-
+[![GDC Views](images/AWG_Portal.png)](images/AWG_Portal.png "Click to see the full image.")
+For more information on any of these general features please review the [GDC Data Portal User Guide](/Data_Portal/Users_Guide/Getting_Started/#navigation).
## Authentication
-### Overview
-
-The GDC Pre-Release Data Portal provides access to datasets prior to release to a group of users specified by the data submitter. This area is only available to data submitters (or their designees) for reviewing pre-release data. Users must be granted access as specified in the admin portal section and also have downloader access within dbGaP for the specified project.
-
-### GDC Authentication Tokens
-
-The GDC Pre-Release Data Portal provides authentication tokens for use with the GDC Data Transfer Tool or the GDC API. To download a token:
-
-1. Log into the GDC using your eRA Commons credentials
-2. Click the username in the top right corner of the screen
-3. Select the "Download token" option
-
-![Token Download Button](images/gdc-data-portal-token-download.png)
-
-A new token is generated each time the `Download Token` button is clicked.
-
-For more information about authentication tokens, see [Data Security](../../Data/Data_Security/Data_Security.md#authentication-tokens).
-
-**NOTE:** The authentication token should be kept in a secure location, as it allows access to all data accessible by the associated user account.
-
### Relationship between GDC Data Portal and Pre-Release Data Portal Tokens
-The tokens used to download files from the GDC Data Portal and Pre-Release Data Portal are related but distinct. Specifically, the token generated in the Pre-Release data portal contains a longer version of the regular GDC Authentication Token downloaded from the GDC Data Portal. Because of this, the GDC Data Portal token will not function for downloading data from the Pre-release Data Portal environment using the Data Transfer Tool or API. However, the Pre-Release Data Portal token will function for downloading data from the GDC Data Portal using the API or Data Transfer Tool. Finally, if a new token is generated in the Pre-release Data Portal this will invalidate the token downloaded from the GDC Data Portal and vice versa.
-
-### Logging Out
-
-To log out of the GDC, click the username in the top right corner of the screen, and select the Logout option.
-
-![Logout link](images/gdc-data-portal-token-download.png)
+The GDC Pre-Release Data Portal provides access to datasets prior to release to a group of users specified by the data submitter. This area is only available to data submitters (or their designees) for reviewing pre-release data. Users must be granted access as specified in the GDC Pre-Release Data Admin Portal section and have downloader access within dbGaP for the specified project. To learn more about obtaining the required credentials and authorization, see [Obtaining Access to Submit Data]( https://gdc.cancer.gov/submit-data/obtaining-access-submit-data).
+The tokens used to download files from the GDC Data Portal and Pre-Release Data Portal are related but distinct. Specifically, the token generated in the Pre-Release Data Portal contains a longer version of the regular GDC Authentication Token downloaded from the GDC Data Portal. Because of this, the GDC Data Portal token will not function for downloading data from the Pre-release Data Portal environment using the Data Transfer Tool or API. However, the Pre-Release Data Portal token will function for downloading data from the GDC Data Portal using the API or Data Transfer Tool. Finally, if a new token is generated in the Pre-release Data Portal this will invalidate the token downloaded from the GDC Data Portal and vice versa.
## Data Transfer Tool
-As with the GDC Data Portal, downloads of large or numerous files is best performed using the GDC Data Transfer Tool. Information on the GDC Data Transfer Tool is available in the [GDC Data Transfer Tool User's Guide](/node/8196/). An important distinction for use with the Pre-Release Data Portal is that it must always be used with a token and with the option `-s https://api.awg.gdc.cancer.gov`.
+As with the GDC Data Portal, downloads of large or numerous files is best performed using the GDC Data Transfer Tool. Information on the GDC Data Transfer Tool is available in the [GDC Data Transfer Tool User's Guide](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Getting_Started/). An important distinction for use with the Pre-Release Data Portal is that it must always be used with a token and with the option `-s https://api.awg.gdc.cancer.gov`.
## GDC Pre-Release Data Admin Portal
-### Overview
-
-The GDC Pre-Release Data Admin Portal allows Pre-Release Data Portal admins to create and maintain Pre-Release Data Groups and associated projects, as well as grant appropriate access to users within these groups. To gain access to the Pre-Release Data Admin Portal please contact the GDC Helpdesk (support@nci-gdc.datacommons.io).
+The GDC Pre-Release Data Admin Portal allows admins to create and maintain Pre-Release Data Groups and associated projects, as well as grant appropriate access to users within these groups. To gain access to the Pre-Release Data Admin Portal please contact the GDC Helpdesk (support@nci-gdc.datacommons.io).
[![GDC Pre-Release Data Portal Main Page](images/AWG_Admin.png)](images/AWG_Admin.png "Click to see the full image.")
The Pre-Release Data Admin Portal is broken into two views on the left-most panel:
-* __Users__: Allows admin to create, view, edit Pre-Release Data Portal user profiles
-* __Groups__: Allows admin to manage groups projects / users
+* __Users__: Allows admin to create, view, edit Pre-Release Data Portal user profiles.
+* __Groups__: Allows admin to manage groups projects / users.
#### Definitions
| Entity | Definition |
|---|---|
| __User__ | An individual with an eRA Commons account. |
-| __Project__ | A collection of files and observations that are contained in the GDC database and have been registered in dbGAP as a project. Only certain projects are designated as Pre-Release Data projects.|
+| __Project__ | A collection of files and observations that are contained in the GDC database and have been registered in dbGaP as a project. Only certain projects are designated as Pre-Release Data projects.|
| __Group__ | A collection of users and projects. When a user is assigned to a group, they will have access to the projects in that group when they login to the Pre-Release Data portal as long as they have downloader access to the project in dbGaP.|
### Users
@@ -95,15 +49,15 @@ The __Users__ section of the GDC Pre-Release Data Admin portal allows admins to
#### Creating Users
-To create a new user in the Pre-Release Data Admin Portal, click on the `Create` button on the far right panel.
+To create a new user in the Pre-Release Data Admin Portal, click on the `Create` button on the far-right panel.
[![GDC Pre-Release Data Portal Main Page](images/AWG_Admin_Create_User.png)](images/AWG_Admin_Create_User.png "Click to see the full image.")
Then the following information must be supplied, before clicking the `Save` button:
-* __eRA Commons ID__: The eRA Commons ID of the user to be added
-* __Role__: Choose between `Admin` or `User` roles
-* __Group (Optional)__: Choose existing groups to add the user to
+* __eRA Commons ID__: The eRA Commons ID of the user to be added.
+* __Role__: Choose between `Admin` or `User` roles.
+* __Group (Optional)__: Choose existing groups to add the user to.
After clicking `Save`, the user should appear in the list of users in the center panel. Also clicking on the user in the list will display information about that user and gives the options to `Edit` the user profile, or `Delete` the user.
@@ -117,24 +71,24 @@ The __Groups__ section of the GDC Pre-Release Data Admin portal allows admins to
#### Creating Groups
-To create a new group in the Pre-Release Data Admin Portal, click on the `Create` button on the far right panel.
+To create a new group in the Pre-Release Data Admin Portal, click on the `Create` button on the far-right panel.
[![GDC Pre-Release Data Portal Main Page](images/AWG_Admin_Groups_Add.png)](images/AWG_Admin_Groups_Add.png "Click to see the full image.")
Then the following information must be supplied, before clicking the `Save` button:
-* __Name__: The name of the group
-* __Description__: The description of the group
-* __Users (Optional)__: Choose existing users to add to the group
-* __Projects(Optional)__: Choose existing projects to add to the group
+* __Name__: The name of the group.
+* __Description__: The description of the group.
+* __Users (Optional)__: Choose existing users to add to the group.
+* __Projects(Optional)__: Choose existing projects to add to the group.
After clicking `Save`, the group should appear in the list of groups in the center panel. Also clicking on the group in the list will display information about that group and gives the options to `Edit` or `Delete` the group.
[![GDC Pre-Release Data Portal Main Page](images/AWG_Admin_New_Group.png)](images/AWG_Admin_New_Group.png "Click to see the full image.")
-## API
+## AWG API
-API functionality is similar to what is available for the main GDC Data Portal. You can read more about the GDC API in general in the [API User Guide](/API/Users_Guide/Getting_Started/). Important differences for the AWG API include the following:
+API functionality is similar to what is available for the main GDC Data Portal. You can read more about the GDC API in general in the [API User Guide](/API/Users_Guide/Getting_Started/). Important differences for the Analysis Working Group (AWG) API include the following:
* The base URL is different. Instead use https://api.awg.gdc.cancer.gov/
* An authorization token must always be passed with every query rather than just for downloading controlled access data.
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC-HomePage-Submit_v2.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC-HomePage-Submit_v2.png
index 56849cc7e..79810d43c 100644
Binary files a/docs/Data_Submission_Portal/Users_Guide/images/GDC-HomePage-Submit_v2.png and b/docs/Data_Submission_Portal/Users_Guide/images/GDC-HomePage-Submit_v2.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Data_Submission_Workflow-updated_20190301.jpg b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Data_Submission_Workflow-updated_20190301.jpg
new file mode 100644
index 000000000..51fce4c7c
Binary files /dev/null and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Data_Submission_Workflow-updated_20190301.jpg differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_SUBMIT_TO_GDC_v3.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC_SUBMIT_TO_GDC_v3.png
new file mode 100644
index 000000000..3cd73a01f
Binary files /dev/null and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_SUBMIT_TO_GDC_v3.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Dashboard_2.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Dashboard_2.png
index 29f8cebd9..dc6357a73 100644
Binary files a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Dashboard_2.png and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Dashboard_2.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Dashboard_4.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Dashboard_4.png
new file mode 100644
index 000000000..69ffdbe76
Binary files /dev/null and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Dashboard_4.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Project_State.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Project_State.png
index 18683c176..8375522e7 100644
Binary files a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Project_State.png and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Project_State.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Project_State_Review_3.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Project_State_Review_3.png
new file mode 100644
index 000000000..ec84b299a
Binary files /dev/null and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Project_State_Review_3.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Project_State_v3.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Project_State_v3.png
new file mode 100644
index 000000000..78e6420e8
Binary files /dev/null and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Project_State_v3.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Submit_tab_2_v4.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Submit_tab_2_v4.png
new file mode 100644
index 000000000..e4423432c
Binary files /dev/null and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Submit_Release_Submit_tab_2_v4.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Transactions_2.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Transactions_2.png
index 5f30a9bb6..7b2c85630 100644
Binary files a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Transactions_2.png and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Transactions_2.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Transactions_Details_2.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Transactions_Details_2.png
index dc75edf28..2e007b0ba 100644
Binary files a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Transactions_Details_2.png and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Transactions_Details_2.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Transactions_Details_3.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Transactions_Details_3.png
new file mode 100644
index 000000000..9f65626fb
Binary files /dev/null and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_Transactions_Details_3.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_UUID_location.png b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_UUID_location.png
new file mode 100644
index 000000000..a12708b36
Binary files /dev/null and b/docs/Data_Submission_Portal/Users_Guide/images/GDC_Submission_UUID_location.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/Submission.png b/docs/Data_Submission_Portal/Users_Guide/images/Submission.png
new file mode 100644
index 000000000..eee369813
Binary files /dev/null and b/docs/Data_Submission_Portal/Users_Guide/images/Submission.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/Untitled.png b/docs/Data_Submission_Portal/Users_Guide/images/Untitled.png
deleted file mode 100644
index 7dbe428eb..000000000
Binary files a/docs/Data_Submission_Portal/Users_Guide/images/Untitled.png and /dev/null differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-data-upload-workflow.png b/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-data-upload-workflow.png
index 4dd08fd1f..81f788499 100644
Binary files a/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-data-upload-workflow.png and b/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-data-upload-workflow.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-data-upload-workflow_2.jpg b/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-data-upload-workflow_2.jpg
new file mode 100644
index 000000000..fd641d17d
Binary files /dev/null and b/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-data-upload-workflow_2.jpg differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-data-upload-workflow_2.png b/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-data-upload-workflow_2.png
new file mode 100644
index 000000000..cec6b4b9f
Binary files /dev/null and b/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-data-upload-workflow_2.png differ
diff --git a/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-file-state-vs-state.png b/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-file-state-vs-state.png
index 75c10971a..a7a08c50f 100644
Binary files a/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-file-state-vs-state.png and b/docs/Data_Submission_Portal/Users_Guide/images/gdc-submission-portal-file-state-vs-state.png differ
diff --git a/docs/Data_Transfer_Tool/Release_Notes/DTT_Release_Notes.md b/docs/Data_Transfer_Tool/Release_Notes/DTT_Release_Notes.md
index c5efa4685..cf9117250 100644
--- a/docs/Data_Transfer_Tool/Release_Notes/DTT_Release_Notes.md
+++ b/docs/Data_Transfer_Tool/Release_Notes/DTT_Release_Notes.md
@@ -2,12 +2,36 @@
| Version | Date |
|---|---|
+| [v1.4.0](DTT_Release_Notes.md#v140) | December 18, 2018 |
| [v1.3.0](DTT_Release_Notes.md#v130) | August 22, 2017 |
| [v1.2.0](DTT_Release_Notes.md#v120) | Oct 31, 2016 |
| [v1.1.0](DTT_Release_Notes.md#v110) | September 7, 2016 |
| [v1.0.1](DTT_Release_Notes.md#v101) | June 2, 2016 |
| [v1.0.0](DTT_Release_Notes.md#v100) | May 26, 2016 |
+## V1.4.0
+* __GDC Product__: Data Transfer Tool
+* __Release Date__: December 18, 2018
+
+### New Features and Changes
+* Enabled download latest file version feature
+* Removal of Interactive mode
+* Enabled display of all default settings
+* Standardized upload and download help menus
+
+### Bugs Fixed Since Last Release
+* Download flag --no-related-files bug preventing file downloads fixed
+* File name handling with forward slashes bug fixed
+* Download flag --no-segment-md5sums bug fixed.
+
+### Known Issues and Workarounds
+* Use of non-ASCII characters in token passed to Data Transfer Tool will produce incorrect error message "Internal server error: Auth service temporarily unavailable".
+* On some terminals, dragging and dropping a file into the interactive client will add single quotes (' ') around the file path. This causes the interactive client to misinterpret the file path and generate an error when attempting to load a manifest file or token.
+ * *Workaround:* Manually type out the file name or remove the single quotes from around the file path.
+* When any files mentioned in the upload manifest are not present in the upload directory the submission will hang at the missing file.
+ * *Workaround:* Edit the manifest to specify only the the files that are present in the upload directory for submission or copy the missing files into the upload directory.
+
+
## v1.3.0
* __GDC Product__: Data Transfer Tool
* __Release Date__: August 22, 2017
@@ -103,12 +127,6 @@
* On some terminals, dragging and dropping a file into the interactive client will add single quotes (' ') around the file path. This causes the interactive client to misinterpret the file path and generate an error when attempting to load a manifest file or token.
* *Workaround:* Manually type out the file name or remove the single quotes from around the file path.
-
-
-
-
-
-
## v1.0.0
* __GDC Product__: Data Transfer Tool
diff --git a/docs/Data_Transfer_Tool/Users_Guide/Accessing_Built-in_Help.md b/docs/Data_Transfer_Tool/Users_Guide/Accessing_Built-in_Help.md
index 7eaf8d739..42df8861f 100644
--- a/docs/Data_Transfer_Tool/Users_Guide/Accessing_Built-in_Help.md
+++ b/docs/Data_Transfer_Tool/Users_Guide/Accessing_Built-in_Help.md
@@ -10,7 +10,7 @@ The GDC Data Transfer Tool comes with built-in help menus. These menus are displ
gdc-client --help
```
``` Output
-usage: gdc-client [-h] [--version] {download,upload,interactive} ...
+usage: gdc-client [-h] [--version] {download,upload,settings} ...
The Genomic Data Commons Command Line Client
@@ -19,11 +19,11 @@ optional arguments:
--version show program's version number and exit
commands:
- {download,upload,interactive}
+ {download,upload,settings}
for more information, specify -h after a command
download download data from the GDC
upload upload data to the GDC
- interactive run in interactive mode
+ settings display default settings
```
The available menus are provided below.
@@ -36,7 +36,7 @@ The GDC Data Transfer Tool displays the following output when executed without a
gdc-client
```
```Output
-usage: gdc-client [-h] [--version] {download,upload,interactive} ...
+usage: gdc-client [-h] [--version] {download,upload,settings} ...
gdc-client: error: too few arguments
```
@@ -49,53 +49,64 @@ The GDC Data Transfer Tool displays the following help menu for its download fun
gdc-client download --help
```
```Output
-usage: gdc-client download [-h] [--debug] [--log-file LOG_FILE]
- [-t TOKEN_FILE] [-d DIR] [-s server]
- [--no-segment-md5sums] [--no-file-md5sum]
- [-n N_PROCESSES]
- [--http-chunk-size HTTP_CHUNK_SIZE]
- [--save-interval SAVE_INTERVAL]
- [--no-verify] [--no-related-files]
- [--no-annotations] [--no-auto-retry]
- [--retry-amount RETRY_AMOUNT]
- [--wait-time WAIT_TIME] [-u] [-m MANIFEST]
- [file_id [file_id ...]]
+usage: gdc-client download [-h] [--debug]
+ [--log-file LOG_FILE]
+ [--color_off] [-t TOKEN_FILE]
+ [-d DIR] [-s server]
+ [--no-segment-md5sums]
+ [--no-file-md5sum]
+ [-n N_PROCESSES]
+ [--http-chunk-size HTTP_CHUNK_SIZE]
+ [--save-interval SAVE_INTERVAL]
+ [--no-verify]
+ [--no-related-files]
+ [--no-annotations]
+ [--no-auto-retry]
+ [--retry-amount RETRY_AMOUNT]
+ [--wait-time WAIT_TIME]
+ [--latest] [--config FILE] [-u]
+ [-m MANIFEST]
+ [file_id [file_id ...]]
positional arguments:
- file_id The GDC UUID of the file(s) to download
+file_id The GDC UUID of the file(s) to download
optional arguments:
- -h, --help show this help message and exit
- --debug Enable debug logging. If a failure occurs, the program
- will stop.
- --log-file LOG_FILE Save logs to file. Amount logged affected by --debug
- -t TOKEN_FILE, --token-file TOKEN_FILE
- GDC API auth token file
- -d DIR, --dir DIR Directory to download files to. Defaults to current dir
- -s server, --server server
+-h, --help show this help message and exit
+--debug Enable debug logging. If a failure occurs, the program
+ will stop.
+--log-file LOG_FILE Save logs to file. Amount logged affected by --debug
+--color_off Disable colored output
+-t TOKEN_FILE, --token-file TOKEN_FILE
+ GDC API auth token file
+-d DIR, --dir DIR Directory to download files to. Defaults to current
+ dir
+-s server, --server server
The TCP server address server[:port]
- --no-segment-md5sums Do not calculate inbound segment md5sumsand/or do not
+--no-segment-md5sums Do not calculate inbound segment md5sums and/or do not
verify md5sums on restart
- --no-file-md5sum Do not verify file md5sum after download
- -n N_PROCESSES, --n-processes N_PROCESSES
- Number of client connections.
- --http-chunk-size HTTP_CHUNK_SIZE
- Size in bytes of standard HTTP block size.
- --save-interval SAVE_INTERVAL
- The number of chunks after which to flush state file.
- A lower save interval will result in more frequent
- printout but lower performance.
- --no-verify Perform insecure SSL connection and transfer
- --no-related-files Do not download related files.
- --no-annotations Do not download annotations.
- --no-auto-retry Ask before retrying to download a file
- --retry-amount RETRY_AMOUNT
- Number of times to retry a download
- --wait-time WAIT_TIME
- Amount of seconds to wait before retrying
- -u, --udt Use the UDT protocol.
- -m MANIFEST, --manifest MANIFEST
- GDC download manifest file
+--no-file-md5sum Do not verify file md5sum after download
+-n N_PROCESSES, --n-processes N_PROCESSES
+ Number of client connections.
+--http-chunk-size HTTP_CHUNK_SIZE, -c HTTP_CHUNK_SIZE
+ Size in bytes of standard HTTP block size.
+--save-interval SAVE_INTERVAL
+ The number of chunks after which to flush state file.
+ A lower save interval will result in more frequent
+ printout but lower performance.
+--no-verify Perform insecure SSL connection and transfer
+--no-related-files Do not download related files.
+--no-annotations Do not download annotations.
+--no-auto-retry Ask before retrying to download a file
+--retry-amount RETRY_AMOUNT
+ Number of times to retry a download
+--wait-time WAIT_TIME
+ Amount of seconds to wait before retrying
+--latest Download latest version of a file if it exists
+--config FILE Path to INI-type config file
+-u, --udt Use the UDT protocol.
+-m MANIFEST, --manifest MANIFEST
+ GDC download manifest file
```
### Upload help menu
@@ -107,38 +118,51 @@ The GDC Data Transfer Tool displays the following help menu for its upload funct
gdc-client upload --help
```
```Output
-usage: gdc-client upload [-h] [--debug] [-v] [--log-file LOG_FILE]
- [-T TOKEN | -t TOKEN] [-H HOST] [-P PORT]
- [--project-id PROJECT_ID] [--identifier IDENTIFIER]
- [--path path] [--upload-id UPLOAD_ID] [--insecure]
- [--server SERVER] [--part-size PART_SIZE]
- [-n N_PROCESSES] [--disable-multipart] [--abort]
- [--resume] [--delete] [--manifest MANIFEST]
+usage: gdc-client upload [-h] [--debug]
+ [--log-file LOG_FILE]
+ [--color_off] [-t TOKEN_FILE]
+ [--project-id PROJECT_ID]
+ [--path path]
+ [--upload-id UPLOAD_ID]
+ [--insecure] [--server SERVER]
+ [--part-size PART_SIZE]
+ [--upload-part-size UPLOAD_PART_SIZE]
+ [-n N_PROCESSES]
+ [--disable-multipart] [--abort]
+ [--resume] [--delete]
+ [--manifest MANIFEST]
+ [--config FILE]
+ [file_id [file_id ...]]
+positional arguments:
+ file_id The GDC UUID of the file(s) to upload
optional arguments:
- -h, --help show this help message and exit
- --debug Enable debug logging. If a failure occurs, the program
- will stop.
- --log-file LOG_FILE Save logs to file. Amount logged affected by --debug
- -t TOKEN_FILE, --token-file TOKEN_FILE
- GDC API auth token file
- --project-id PROJECT_ID, -p PROJECT_ID
- The project ID that owns the file
- --path path, -f path directory path to find file
- --upload-id UPLOAD_ID, -u UPLOAD_ID
- Multipart upload id
- --insecure, -k Allow connections to server without certs
- --server SERVER, -s SERVER
- GDC API server address
- --part-size PART_SIZE, -ps PART_SIZE
- Part size for multipart upload
- -n N_PROCESSES, --n-processes N_PROCESSES
- Number of client connections
- --disable-multipart Disable multipart upload
- --abort Abort previous multipart upload
- --resume, -r Resume previous multipart upload
- --delete Delete an uploaded file
- --manifest MANIFEST, -m MANIFEST
- Manifest which describes files to be uploaded
---identifier, -i DEPRECATED
+ -h, --help show this help message and exit
+ --debug Enable debug logging. If a failure occurs, the program
+ will stop.
+ --log-file LOG_FILE Save logs to file. Amount logged affected by --debug
+ --color_off Disable colored output
+ -t TOKEN_FILE, --token-file TOKEN_FILE
+ GDC API auth token file
+ --project-id PROJECT_ID, -p PROJECT_ID
+ The project ID that owns the file
+ --path path, -f path directory path to find file
+ --upload-id UPLOAD_ID, -u UPLOAD_ID
+ Multipart upload id
+ --insecure, -k Allow connections to server without certs
+ --server SERVER, -s SERVER
+ GDC API server address
+ --part-size PART_SIZE
+ DEPRECATED in favor of [--upload-part-size]
+ --upload-part-size UPLOAD_PART_SIZE, -c UPLOAD_PART_SIZE
+ Part size for multipart upload
+ -n N_PROCESSES, --n-processes N_PROCESSES
+ Number of client connections
+ --disable-multipart Disable multipart upload
+ --abort Abort previous multipart upload
+ --resume, -r Resume previous multipart upload
+ --delete Delete an uploaded file
+ --manifest MANIFEST, -m MANIFEST
+ Manifest which describes files to be uploaded
+ --config FILE Path to INI-type config file
```
diff --git a/docs/Data_Transfer_Tool/Users_Guide/Appendix_A_-_Config_File.md b/docs/Data_Transfer_Tool/Users_Guide/Appendix_A_-_Config_File.md
new file mode 100644
index 000000000..48b7b1d2e
--- /dev/null
+++ b/docs/Data_Transfer_Tool/Users_Guide/Appendix_A_-_Config_File.md
@@ -0,0 +1,40 @@
+###Data Transfer Tool Configuration File
+The DTT has the ability to save and reuse configuration parameters in the format of a flat text file via a command line argument. A simple text file needs to be created first with an extension of either txt or dtt. The supported section headers are upload and download which can be used independently of each other or used in the same configuration file. Each section header corresponds to the main functions of the application which are to either download data from the GDC portals or to upload data to the submission system of the GDC. The configurable parameters are those listed in the help menus under either [download](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Accessing_Built-in_Help/#download-help-menu) or [upload](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Accessing_Built-in_Help/#upload-help-menu).
+
+
+Example usage:
+
+ gdc-client download d45ec02b-13c3-4afa-822d-443ccd3795ca --config my-dtt-config.dtt
+
+Example of configuration file:
+
+ [upload]
+ path = /some/upload/path
+ http_chunk_size = 1024
+
+
+ [download]
+ dir = /some/download/path
+ http_chunk_size = 2048
+ retry_amount = 6
+
+
+###Display Config Parameters
+This command line flag can be used with either the download or upload application feature to display what settings are active within a custom data transfer tool configuration file.
+
+ gdc-client settings download --config my-dtt-config.dtt
+ [download]
+ no_auto_retry = False
+ no_file_md5sum = False
+ save_interval = 1073741824
+ http_chunk_size = 2048
+ server = http://exmple-site.com
+ n_processes = 8
+ no_annotations = False
+ no_related_files = False
+ retry_amount = 6
+ no_segment_md5sum = False
+ manifest = []
+ wait_time = 5.0
+ no_verify = True
+ dir = /some/download/path
diff --git a/docs/Data_Transfer_Tool/Users_Guide/Appendix_A_-_Key_Terms.md b/docs/Data_Transfer_Tool/Users_Guide/Appendix_A_-_Key_Terms.md
deleted file mode 100644
index 61a7e1e16..000000000
--- a/docs/Data_Transfer_Tool/Users_Guide/Appendix_A_-_Key_Terms.md
+++ /dev/null
@@ -1,13 +0,0 @@
-The following table provides definitions and explanations for terms and acronyms relevant to the content presented within this document.
-
-| Term | Definition |
-|-------|--------------------------------------------------|
-| eRA | Electronic Research Administration |
-| GDC | Genomic Data Commons |
-| HTTP | Hypertext Transfer Protocol |
-| HTTPS | HTTP Secure |
-| ID | Identifier |
-| NCI | National Cancer Institute |
-| TCGA | The Cancer Genome Atlas |
-| TCP | Transmission Control Protocol |
-| UUID | Universally Unique Identifier |
diff --git a/docs/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload.md b/docs/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload.md
index a57d21721..4ea3daac1 100644
--- a/docs/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload.md
+++ b/docs/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload.md
@@ -1,4 +1,6 @@
-#Data Downloads and Uploads from the command line.
+#Data Transfer Tool Command Line Documentation
+
+
## Downloads
@@ -26,6 +28,19 @@ The GDC Data Transfer Tool supports resumption of interrupted downloads. To resu
gdc-client download f80ec672-d00f-42d5-b5ae-c7e06bc39da1
+### Download Latest Version of a File
+The GDC Data Transfer Tool supports file versioning. Our backend data storage supports multiple file versions so older and current versions can be accessible to our users. For information about accessing file versioning information with our API and finding older UUID information from current UUIDs please check out the [the API User Guide](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#example-of-retrieving-file-version-information) section in our API documentation. When working with older manifests or older lists of UUIDs the latest version of a file can always be download with the --latest flag.
+
+```Shell
+gdc-client download 426de656-7e34-4a49-b87e-6e2563fa3cdd --latest -t gdc-user-token.2018.txt
+```
+```Output
+Downloading LATEST versions of files
+Latest version for 426de656-7e34-4a49-b87e-6e2563fa3cdd ==> 6633bfbd-87f1-4d3a-a475-7ad1e8c2017a
+100% [#############################################################################################################################] Time: 0:01:16 14.10 MB/s
+Successfully downloaded: 1
+```
+
### Downloading Controlled-Access Data
A user authentication token is required for downloading Controlled-Access Data from GDC. Tokens can be obtained from the GDC Data Portal (see instructions in [Obtaining an Authentication Token](Preparing_for_Data_Download_and_Upload.md#obtaining-an-authentication-token)). Once downloaded, the token *file* can be passed to the GDC Data Transfer Tool using the **-t** or **--token-file** option:
@@ -137,3 +152,211 @@ To resolve this issue, delete the file using the **--delete** switch before re-u
Attempting to run gdc-client.exe by double-clicking it in the Windows Explorer will produce a window that blinks once and disappears.
This is normal, the executable must be run using the command prompt. Click 'Start', followed by 'Run' and type 'cmd' into the text bar. Then navigate to the path containing the executable using the 'cd' command.
+
+## Help Menus
+
+The GDC Data Transfer Tool comes with built-in help menus. These menus are displayed when the GDC Data Transfer Tool is run with flags -h or --help for any of the main arguments to the tool. Running the GDC Data Transfer Tool without argument or flag will present a list of available command options.
+
+
+
+```Shell
+gdc-client --help
+```
+``` Output
+usage: gdc-client [-h] [--version] {download,upload,settings} ...
+
+The Genomic Data Commons Command Line Client
+
+optional arguments:
+ -h, --help show this help message and exit
+ --version show program's version number and exit
+
+commands:
+ {download,upload,settings}
+ for more information, specify -h after a command
+ download download data from the GDC
+ upload upload data to the GDC
+ settings display default settings
+```
+
+ The available menus are provided below.
+
+### Root menu
+
+The GDC Data Transfer Tool displays the following output when executed without any arguments.
+
+```Shell
+gdc-client
+```
+```Output
+usage: gdc-client [-h] [--version] {download,upload,settings} ...
+gdc-client: error: too few arguments
+```
+
+
+### Download help menu
+
+The GDC Data Transfer Tool displays the following help menu for its download functionality.
+
+```Shell
+gdc-client download --help
+```
+```Output
+usage: gdc-client download [-h] [--debug]
+ [--log-file LOG_FILE]
+ [--color_off] [-t TOKEN_FILE]
+ [-d DIR] [-s server]
+ [--no-segment-md5sums]
+ [--no-file-md5sum]
+ [-n N_PROCESSES]
+ [--http-chunk-size HTTP_CHUNK_SIZE]
+ [--save-interval SAVE_INTERVAL]
+ [--no-verify]
+ [--no-related-files]
+ [--no-annotations]
+ [--no-auto-retry]
+ [--retry-amount RETRY_AMOUNT]
+ [--wait-time WAIT_TIME]
+ [--latest] [--config FILE] [-u]
+ [-m MANIFEST]
+ [file_id [file_id ...]]
+
+positional arguments:
+file_id The GDC UUID of the file(s) to download
+
+optional arguments:
+-h, --help show this help message and exit
+--debug Enable debug logging. If a failure occurs, the program
+ will stop.
+--log-file LOG_FILE Save logs to file. Amount logged affected by --debug
+--color_off Disable colored output
+-t TOKEN_FILE, --token-file TOKEN_FILE
+ GDC API auth token file
+-d DIR, --dir DIR Directory to download files to. Defaults to current
+ dir
+-s server, --server server
+ The TCP server address server[:port]
+--no-segment-md5sums Do not calculate inbound segment md5sums and/or do not
+ verify md5sums on restart
+--no-file-md5sum Do not verify file md5sum after download
+-n N_PROCESSES, --n-processes N_PROCESSES
+ Number of client connections.
+--http-chunk-size HTTP_CHUNK_SIZE, -c HTTP_CHUNK_SIZE
+ Size in bytes of standard HTTP block size.
+--save-interval SAVE_INTERVAL
+ The number of chunks after which to flush state file.
+ A lower save interval will result in more frequent
+ printout but lower performance.
+--no-verify Perform insecure SSL connection and transfer
+--no-related-files Do not download related files.
+--no-annotations Do not download annotations.
+--no-auto-retry Ask before retrying to download a file
+--retry-amount RETRY_AMOUNT
+ Number of times to retry a download
+--wait-time WAIT_TIME
+ Amount of seconds to wait before retrying
+--latest Download latest version of a file if it exists
+--config FILE Path to INI-type config file
+-u, --udt Use the UDT protocol.
+-m MANIFEST, --manifest MANIFEST
+ GDC download manifest file
+```
+
+### Upload help menu
+
+The GDC Data Transfer Tool displays the following help menu for its upload functionality.
+
+
+```Shell
+gdc-client upload --help
+```
+```Output
+usage: gdc-client upload [-h] [--debug]
+ [--log-file LOG_FILE]
+ [--color_off] [-t TOKEN_FILE]
+ [--project-id PROJECT_ID]
+ [--path path]
+ [--upload-id UPLOAD_ID]
+ [--insecure] [--server SERVER]
+ [--part-size PART_SIZE]
+ [--upload-part-size UPLOAD_PART_SIZE]
+ [-n N_PROCESSES]
+ [--disable-multipart] [--abort]
+ [--resume] [--delete]
+ [--manifest MANIFEST]
+ [--config FILE]
+ [file_id [file_id ...]]
+positional arguments:
+ file_id The GDC UUID of the file(s) to upload
+
+optional arguments:
+ -h, --help show this help message and exit
+ --debug Enable debug logging. If a failure occurs, the program
+ will stop.
+ --log-file LOG_FILE Save logs to file. Amount logged affected by --debug
+ --color_off Disable colored output
+ -t TOKEN_FILE, --token-file TOKEN_FILE
+ GDC API auth token file
+ --project-id PROJECT_ID, -p PROJECT_ID
+ The project ID that owns the file
+ --path path, -f path directory path to find file
+ --upload-id UPLOAD_ID, -u UPLOAD_ID
+ Multipart upload id
+ --insecure, -k Allow connections to server without certs
+ --server SERVER, -s SERVER
+ GDC API server address
+ --part-size PART_SIZE
+ DEPRECATED in favor of [--upload-part-size]
+ --upload-part-size UPLOAD_PART_SIZE, -c UPLOAD_PART_SIZE
+ Part size for multipart upload
+ -n N_PROCESSES, --n-processes N_PROCESSES
+ Number of client connections
+ --disable-multipart Disable multipart upload
+ --abort Abort previous multipart upload
+ --resume, -r Resume previous multipart upload
+ --delete Delete an uploaded file
+ --manifest MANIFEST, -m MANIFEST
+ Manifest which describes files to be uploaded
+ --config FILE Path to INI-type config file
+```
+
+##Data Transfer Tool Configuration File
+The DTT has the ability to save and reuse configuration parameters in the format of a flat text file via a command line argument. A simple text file needs to be created first with an extension of either txt or dtt. The supported section headers are upload and download which can be used independently of each other or used in the same configuration file. Each section header corresponds to the main functions of the application which are to either download data from the GDC portals or to upload data to the submission system of the GDC. The configurable parameters are those listed in the help menus under either [download](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Accessing_Built-in_Help/#download-help-menu) or [upload](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Accessing_Built-in_Help/#upload-help-menu).
+
+
+Example usage:
+
+ gdc-client download d45ec02b-13c3-4afa-822d-443ccd3795ca --config my-dtt-config.dtt
+
+Example of configuration file:
+
+ [upload]
+ path = /some/upload/path
+ http_chunk_size = 1024
+
+
+ [download]
+ dir = /some/download/path
+ http_chunk_size = 2048
+ retry_amount = 6
+
+
+###Display Config Parameters
+This command line flag can be used with either the download or upload application feature to display what settings are active within a custom data transfer tool configuration file.
+
+ gdc-client settings download --config my-dtt-config.dtt
+ [download]
+ no_auto_retry = False
+ no_file_md5sum = False
+ save_interval = 1073741824
+ http_chunk_size = 2048
+ server = http://exmple-site.com
+ n_processes = 8
+ no_annotations = False
+ no_related_files = False
+ retry_amount = 6
+ no_segment_md5sum = False
+ manifest = []
+ wait_time = 5.0
+ no_verify = True
+ dir = /some/download/path
diff --git a/docs/Encyclopedia/ReadyForApproval/DAVE.md b/docs/Encyclopedia/ReadyForApproval/DAVE.md
new file mode 100644
index 000000000..33bc88c26
--- /dev/null
+++ b/docs/Encyclopedia/ReadyForApproval/DAVE.md
@@ -0,0 +1,16 @@
+# DAVE #
+
+## Description ##
+The NCI Genomic Data Commons's DAVE (Data Analysis, Visualization, and Exploration) tools
+are an open access interactive visualization application created to interact with the data stored in the GDC. Analysis can be performed in real time and online without downloading any of the data.
+## Overview ##
+
+
+### Tools ###
+## References ##
+1.[GDC DAVE]https://gdc.cancer.gov/dave-factsheet
+
+## External Links ##
+* N/A
+
+Categories: Workflow Type
diff --git a/docs/Encyclopedia/ReadyForApproval/FASTQv3.md b/docs/Encyclopedia/ReadyForApproval/FASTQv3.md
new file mode 100644
index 000000000..3b64ea4cd
--- /dev/null
+++ b/docs/Encyclopedia/ReadyForApproval/FASTQv3.md
@@ -0,0 +1,19 @@
+# FASTQ #
+## Description ##
+Is a file format standard used to store text-based sequence and quality scores information generated from high-throughput sequencing data acquisition systems.
+
+## Overview ##
+
+### Structure ###
+The format of the a FASTQ file consists of:
+1) A line starting with "@" and containing the sequence identifier along with an optional description.
+2) Lines consisting of raw sequence information.
+3) A line starting with the "+" repeating the sequence ID or left blank.
+4) Lines containing the quality scores information.
+
+## References ##
+1. [FASTQ Format]https://en.wikipedia.org/wiki/FASTQ_format
+
+## External Links ##
+*
+Categories: Data Format
diff --git a/docs/Encyclopedia/ReadyForApproval/Mutation_Annotation_Format_TCGAv2.md b/docs/Encyclopedia/ReadyForApproval/Mutation_Annotation_Format_TCGAv2.md
new file mode 100644
index 000000000..7e20fc7f5
--- /dev/null
+++ b/docs/Encyclopedia/ReadyForApproval/Mutation_Annotation_Format_TCGAv2.md
@@ -0,0 +1,373 @@
+Mutation Annotation Format (MAF) - Legacy TCGA Specification
+==============================================
+
+
+*This definition was taken from the previously public wiki hosted by TCGA and reflects the MAF format
+that was available during the active period of the TCGA project.*
+
+
+
+
+**Document Information**
+
+The spec has been reverted to the June 26th version (version 20). Additional
+changes are the removal of the "under construction" banner, changing all text to
+black, and fixing a typo in the link to the MAF 2.2 specification.
+
+**Specification for Mutation Annotation Format**
+Version 2.4.1
+June 20, 2014
+
+**Contents**
+
+- 1 Current version changes
+
+- 2 About MAF specifications
+
+ - 2.1 Definition of open access MAF
+ data
+
+ - 2.2 Somatic MAF vs. Protected
+ MAF
+
+- 3 MAF file fields
+
+ - 3.1 Table 1 - File column
+ headers
+
+- 4 MAF file checks
+
+- 5 MAF naming convention
+
+- 6 Previous specification
+ versions
+
+Current version changes
+=======================
+
+This current revision is **version 2.4.1** of the Mutation Annotation Format
+(MAF) specification.
+
+The following items in the specification were added or modified in version 2.4.1
+from version 2.4:
+
+- Header for MAF file is "\#version 2.4.1"
+
+- "Somatic" and "None" are the only acceptable values for "Mutation_Status"
+ for a somatic.MAF (named .somatic.maf). When Mutation_Status is None,
+ Validation_Status must be Invalid.
+
+- Centers need to make sure that Mutations_Status "None" doesn't include
+ germline mutation.
+
+- For a somatic MAF, following rules should be satisfied:
+ SOMATIC = (A AND (B OR C OR D)) OR (E AND F)
+ A: *Mutation_Status* == "Somatic"
+ B: *Validation_Status* == "Valid"
+ C. *Verification_Status* == "Verified"
+ D. *Variant_Classification* is not {Intron, 5'UTR, 3'UTR, 5'Flank, 3'Flank,
+ IGR}, which implies that *Variant_Classification* can only be
+ \\{Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins,
+ Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site,
+ Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region}.
+ E: *Mutations_status == "None"*
+ F: *Validation_status == "Invalid"*
+
+- Extra validation rules: If Validation_Status == Valid or Invalid, then
+ Validation_Method != none (case insensitive).
+
+About MAF specifications
+========================
+
+Mutation annotation files should be transferred to the DCC. Those files should
+be formatted using the mutation annotation format (MAF) that is described below.
+File naming convention is also
+[below](#MutationAnnotationFormat(MAF)Specificat).
+
+Following categories of somatic mutations are reported in MAF files:
+
+- Missense and nonsense
+
+- Splice site, defined as SNP within 2 bp of the splice junction
+
+- Silent mutations
+
+- Indels that overlap the coding region or splice site of a gene or the
+ targeted region of a genetic element of interest.
+
+- Frameshift mutations
+
+- Mutations in regulatory regions
+
+### Definition of open access MAF data
+
+A large proportion of MAFs are submitted as discovery data and sites labeled as
+somatic in these files overlap with known germline variants. In order to
+minimize germline contamination in putative (unvalidated) somatic calls, certain
+filtering criteria have been imposed. Based on current policy, open access MAF
+data should:
+
+- **include** all validated somatic mutation calls
+
+- **include** all unvalidated somatic mutation calls that overlap with a
+ coding region or splice site
+
+- **exclude** all other types of mutation calls (i.e., non-somatic calls
+ (validated or not), unvalidated somatic calls that are not in coding region
+ or splice sites, and dbSNP sites that are not annotated as somatic in dbSNP,
+ COSMIC or OMIM)
+
+
+
+### Somatic MAF vs. Protected MAF
+
+Centers will submit to the DCC MAF archives that contain Somatic MAF
+(named**.somatic.maf**) for open access data and an all-inclusive Protected MAF
+(named**.protected.maf**) that does not filter any data out and represents the
+original super-set of mutation calls. The files will be formatted using the
+Mutation Annotation Format (MAF).
+
+The following table lists some of the critical attributes of somatic and
+protected MAF files and provides a comparison.
+
+| Attribute | Somatic MAF | Protected MAF |
+| ----------- | ----------- | ------------- |
+| **File naming** | Somatic MAFs should be named as**\*.somatic.maf**and cannot contain 'germ' or 'protected' in file name. | Protected MAFs should be named as**\*.protected.maf**and should not contain 'somatic' in the file name. |
+| **Mutation category** | Somatic MAFs can only contain entries where*Mutation_Status*is "Somatic". If any other value is assigned to the field, the archive will fail. Experimentally validated or unvalidated (see next row) somatic mutations can be included in the file. | There is no such restriction for protected MAF. The file should contain all mutation calls including those from which .somatic.maf is derived. |
+| **Filtering criteria** | In order to minimize germline contamination, somatic MAFs can contain unvalidated somatic mutations only from coding regions and splice sites, which implies: | There are no such constraints for mutations in protected MAF. |
+| | If *Validation_Status* **is**"Unknown",*V a riant_Classification* **cannot** be 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, or Intron.*Variant_Classification*can only be \\{Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame\\}. | |
+| | There is no such constraint for experimentally validated (*Validation_Status*is "Valid") somatic mutations. | |
+| | | |
+| | dbSNP sites that are not annotated as somatic in dbSNP, COSMIC or OMIM must be removed from somatic MAFs. | |
+| **Access level** | These files are deployed as open access data. | These files are deployed as protected data. |
+
+MAF file fields
+===============
+
+The format of a MAF file is tab-delimited columns. Those columns are described
+in Table 1 and are required in every MAF file. The order of the columns will be
+validated by the DCC. Column headers and values **are** case sensitive where
+specified. Columns may allow null values (i.e.\_ blank cells) and/or have
+enumerated values. **The validator looks for a header stating the version of the
+specification to validate against (e.g. \#version 2.4). If not, validation
+fails.** Any columns that come after the columns described in Table 1 are
+optional. Optional columns are not validated by the DCC and can be in any order.
+
+
+
+Table 1 - File column headers
+-----------------------------
+
+
+
+| **Index** | **MAF Column Header** | **Description of Values** | **Example** | **Case Sensitive** | **Null** | **Enumerated** |
+| --------- | --------------------- | ------------------------- | ----------- | ------------------ | -------- | -------------- |
+| 1 | Hugo_Symbol | HUGO symbol for the gene (HUGO symbols are *always* in all caps). If no gene exists within 3kb enter "Unknown". |EGFR | Yes | No | Set or Unknown | | | | | | | | | |
+| | | Source: | | | | | | | | | | | | | |
+| 2 | Entrez_Gene_Id | Entrez gene ID (an integer). If no gene exists within 3kb enter "0". | 1956 | No | No | Set | | | | | | | | | |
+| | | Source: | | | | | | | | | | | | | |
+| 3 | Center | Genome sequencing center reporting the variant. If multiple institutions report the same mutation separate list using semicolons. Non-GSC centers will be also supported if center name is an accepted center name. | hgsc.bcm.edu;genome.wustl.edu | Yes | No | Set | | | | | | | | | |
+| 4 | NCBI_Build | Any TGCA accepted genome identifier. Can be string, integer or a float. | hg18, hg19, GRCh37, GRCh37-lite, 36, 36.1, 37, | No | No | Set and Enumerated. | | | | | | | | | |
+| 5 | Chromosome | Chromosome number without "chr" prefix that contains the gene. | X, Y, M, 1, 2, etc. | Yes | No | Set | | | | | | | | | |
+| 6 | Start_Position | Lowest numeric position of the reported variant on the genomic reference sequence. Mutation start coordinate (1-based coordinate system). | 999 | No | No | Set | | | | | | | | | |
+| 7 | End_Position | Highest numeric genomic position of the reported variant on the genomic reference sequence. Mutation end coordinate (inclusive, 1-based coordinate system). | 1000 | No | No | Set | | | | | | | | | |
+| 8 | Strand | Genomic strand of the reported allele. Variants should always be reported on the positive genomic strand. (Currently, only the positive strand is an accepted value). | \+ | No | No | \+ | | | | | | | | | |
+| 9 | Variant_Classification | Translational effect of variant allele. | Missense_Mutation | Yes | No | Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR *(See Notes Section #1)* , Intron, RNA, Targeted_Region | | | | | | | | | |
+| 10 | Variant_Type | Type of mutation. TNP (tri-nucleotide polymorphism) is analogous to DNP but for 3 consecutive nucleotides. ONP (oligo-nucleotide polymorphism) is analogous to TNP but for consecutive runs of 4 or more. | INS | Yes | No | SNP, DNP, TNP, ONP, INS, DEL, or Consolidated *(See Notes Section #2)* ) | | | | | | | | | |
+| 11 | Reference_Allele | The plus strand reference allele at this position. Include the sequence deleted for a deletion, or "-" for an insertion. | A | Yes | No | A,C,G,T and/or - | | | | | | | | | |
+| 12 | Tumor_Seq_Allele1 | Primary data genotype. Tumor sequencing (discovery) allele 1. " -" for a deletion represent a variant. "-" for an insertion represents wild-type allele. Novel inserted sequence for insertion should not include flanking reference bases. | C | Yes | No | A,C,G,T and/or - | | | | | | | | | |
+| 13 | Tumor_Seq_Allele2 | Primary data genotype. Tumor sequencing (discovery) allele 2. " -" for a deletion represents a variant. "-" for an insertion represents wild-type allele. Novel inserted sequence for insertion should not include flanking reference bases. | G | Yes | No | A,C,G,T and/or - | | | | | | | | | |
+| 14 | dbSNP_RS | Latest dbSNP rs ID (dbSNP_ID) or "novel" if there is no dbSNP record. source: | rs12345 | Yes | Yes | Set or "novel" | | | | | | | | | |
+| 15 | dbSNP_Val_Status | dbSNP validation status. Semicolon- separated list of validation statuses. | by2Hit2Allele;byCluster | No | Yes | by1000genomes;by2Hit2Allele; byCluster; byFrequency; byHapMap; byOtherPop; bySubmitter; alternate_allele *(See Notes Section #3)* **Note that "none" will no longer be an acceptable value.** | | | | | | | | | |
+| 16 | Tumor_Sample_Barcode | BCR aliquot barcode for the tumor sample including the two additional fields indicating plate and well position. i.e. TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The full TCGA Aliquot ID. | TCGA-02-0021-01A-01D-0002-04 | Yes | No | Set | | | | | | | | | |
+| 17 | Matched_Norm_Sample_Barcode | BCR aliquot barcode for the matched normal sample including the two additional fields indicating plate and well position. i.e. TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The full TCGA Aliquot ID; e.g. TCGA-02-0021-10A-01D-0002-04 (compare portion ID '10A' normal sample, to '01A' tumor sample). | TCGA-02-0021-10A-01D-0002-04 | Yes | No | Set | | | | | | | | | |
+| 18 | Match_Norm_Seq_Allele1 | Primary data. Matched normal sequencing allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | T | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 19 | Match_Norm_Seq_Allele2 | Primary data. Matched normal sequencing allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | ACGT | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 20 | Tumor_Validation_Allele1 | Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | \- | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 21 | Tumor_Validation_Allele2 | Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | A | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 22 | Match_Norm_Validation_Allele1 | Secondary data from orthogonal technology. Matched normal genotyping (validation) for allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | C | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 23 | Match_Norm_Validation_Allele2 | Secondary data from orthogonal technology. Matched normal genotyping (validation) for allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | G | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 24 | Verification_Status *(See Notes Section #4)* | Second pass results from independent attempt using same methods as primary data source. Generally reserved for 3730 Sanger Sequencing. | Verified | Yes | Yes | Verified, Unknown | | | | | | | | | |
+| 25 | Validation_Status *(See Notes Section #5)* | Second pass results from orthogonal technology. | Valid | Yes | No | Untested, Inconclusive, Valid, Invaild | | | | | | | | | |
+| 26 | Mutation_Status | Updated to reflect validation or verification status and to be in agreement with the [VCF VLS](https://wiki.nci.nih.gov/x/2gcYAw) field. The values allowed in this field are constrained by the value in the Validation_Status field. | Somatic | Yes | No | **Validation_Status values:** Untested, Inconslusive, Valid, Invalid - **Allowed Mutations_Status Values for Untested and Inconclusive:** *(See Notes Seciton #6)* None, Germline, Somatic, LOH, Post-transcriptional modification **Unknown Allowed Mutation_status Values for Valid:** *(See Notes Seciton #6)* Germline, Somatic, LOH, Post-transcriptional modification, Unknown - **Allowed Mutations_Status Values for Invalid:** *(See Notes Seciton #6)* none | | | | | | | | | |
+ | | | | | | | | | | | | | | |
+| 27 | Sequencing_Phase | TCGA sequencing phase. Phase should change under any circumstance that the targets under consideration change. | Phase_I | No | Yes | No | | | | | | | | | |
+| 28 | Sequence_Source | Molecular assay type used to produce the analytes used for sequencing. Allowed values are a subset of the [SRA 1.5](http://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/sra/doc/SRA_1-5/) library_strategy field values. This subset matches those used at CGHub. | WGS;WXS | Yes | No | **Common TCGA values:** WGS, WGA, WXS, RNA-Seq, miRNA-Seq, Bisulfite-Seq, VALIDATION, Other **Other allowed values (per SRA 1.5)** ncRNA-Seq, WCS, CLONE, POOLCLONE, AMPLICON, CLONEEND, FINISHING, ChIP-Seq, MNase-Seq, DNase-Hypersensitivity, EST, FL-cDNA, CTS, MRE-Seq, MeDIP-Seq, MBD-Seq, Tn-Seq, FAIRE-seq, SELEX, RIP-Seq, ChIA-PET
+ | | | | | | | | | |
+| 29 | Validation_Method | The assay platforms used for the validation call. Examples: Sanger_PCR_WGA, Sanger_PCR_gDNA, 454_PCR_WGA, 454_PCR_gDNA; separate multiple entries using semicolons. | Sanger_PCR_WGA;Sanger_PCR_gDNA | No | **NO**. I**f Validation_Status = Untested then "none"** If Validation_Status = Valid or Invalid, then not "none" (case insensitive) | No | | | | | | | | | |
+| 30 | Score | Not in use. | NA | No | Yes | No | | | | | | | | | |
+| 31 | BAM_File | Not in use. | NA | No | Yes | No | | | | | | | | | |
+| 32 | Sequencer | Instrument used to produce primary data. Separate multiple entries using semicolons. | Illumina GAIIx;SOLID | Yes | No | Illumina GAIIx, Illumina HiSeq, SOLID, 454, ABI 3730xl, Ion Torrent PGM, Ion Torrent Proton, PacBio RS, Illumina MiSeq, Illumina HiSeq 2500, 454 GS FLX Titanium, AB SOLiD 4 System | | | | | | | | | |
+| 33 | Tumor_Sample_UUID | BCR aliquot UUID for tumor sample | 550e8400-e29b-41d4-a716-446655440000 | Yes | No | | | | | | | | | | |
+| 34 | Matched_Norm_Sample_UUID | BCR aliquot UUID for matched normal | 567e8487-e29b-32d4-a716-446655443246 | Yes | No |
+
+**Notes**
+*1 Intergenic Region.*
+*2 Consolidationd is used to indicate a site that was initially reported as a variant but subsequently removed from further analysis because it was consolidated into a new variant. For example, a SNP variant incorporated into a TNP variant.*
+*3 Used when the discovered varieant differs from that of dbSNP.*
+*4 These MAF headers describe the technology that was used to confirm a mutation, whether the same technology ("verification") or a different technology ("validation") is used to prove that a variant is germline or a somatic mutation.*
+*5 These MAF headers describe the technology that was used toconfirm a mutation, whether the same technology (verification) or a different technology (validation) is used to prove that a variant is germline or a somatic mutation.*
+*6 Explanation of some Validation Status-Mutation Status combinations.*
+
+| Validation Status | Mutation Status | Explanation |
+| ------------------ | --------------- | ----------- |
+| Valid | Unknown | a valid variant with unknown somatic status due to lack of data from matched normal tissue. |
+| Invalid | None | validation attempted, tumor and normal are homozygous reference (formerly described as Wildtype) |
+| Inconclusive | Unknown | validation failed, neither the genotype nor its somatic status is certain due to lack of data from matched normal tissue |
+| Inconclusive | None | validation failed, tumor genotype appears to be homozygous reference |
+
+ Important Criteria
+
+ **Index column indicates the order in which the columns are expected**. **All
+ headers are case sensitive.** The Case Sensitive column specifies which values
+ are case sensitive. The Null column indicates which MAF columns are allowed to
+ have null values. The Enumerated column indicates which MAF columns have
+ specified values: an Enumerated value of "No" indicates that there are no
+ specified values for that column; other values indicate the specific values
+ listed allowed; a value of "Set" indicates that the MAF column values come from
+ a specified set of known values (*e.g.*HUGO gene symbols).
+
+
+MAF file checks
+===============
+
+The DCC Archive Validator checks the integrity of a MAF file. Validation will
+fail if any of the below are not true for a MAF file:
+
+1. Column header text (including case) and order must match specification
+ (Table 1) exactly
+
+2. Values under column headers listed in the specification (Table 1) as not
+ null must have values
+
+3. Values that are specified in Table 1 as Case Sensitive must be.
+
+4. If column headers are listed in the specification as having *enumerated*
+ values (*i.e.* a "Yes" in the "Enumerated" column), then the values under
+ those column must come from the enumerated values listed under "Enumerated".
+
+5. If column headers are listed in the specification as having *set* values
+ (*i.e.* a "Set" in the "Enumerated" column), then the values under those
+ column must come from the enumerated values of that domain (*e.g.* HUGO gene
+ symbols).
+
+6. All Allele-based columns must contain- (deletion), or a string composed of
+ the following capitalized letters: A, T, G, C.
+
+7. IfValidation_Status== "Untested"
+ thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2can
+ be null (depending onValidation_Status).
+
+ 1. IfValidation_Status== "Inconclusive"
+ thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2can
+ be null (depending onValidation_Status)**.**
+
+8. If Validation_Status == Valid, then Validated_Tumor_Allele1 and
+ Validated_Tumor_Allele2must be populated (one of A, C, G, T, and -)
+
+ 1. If Validation_Status == "Valid" then Tumor_Validation_Allele1,
+ Tumor_Validation_Allele2, Match_Norm_Validation_Allele1,
+ Match_Norm_Validation_Allele2 cannot be null
+
+ 2. IfValidation_Status== "Invalid"
+ thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2cannot
+ be null AND Tumor_Validation_Allelle1 ==
+ Match_Norm_Validation_Allele1AND Tumor_Validation_Allelle2 ==
+ Match_Norm_Validation_Allele2 (Added as a replacement for 8a as a
+ result of breakdown)
+
+9. Check allele values against Mutation_Status:
+ Check allele values against Validation_status:
+
+ 1. If Mutation_Status == "Germline" and Validation_Status == "Valid", then
+ Tumor_Validation_Allele1 == Match_Norm_Validation_Allele1 and
+ Tumor_Validation_Allele2 == Match_Norm_Validation_Allele2.
+
+ 2. If Mutation_Status == "Somatic" and Validation_Status == "Valid", then
+ Match_Norm_Validation_Allele1 == Match_Norm_Validation_Allele2 ==
+ Reference_Allele and (Tumor_Validation_Allele1 or
+ Tumor_Validation_Allele2) != Reference_Allele
+
+ 3. If Mutation_Status == "LOH" and Validation_Status=="Valid", then
+ Tumor_Validation_Allele1 == Tumor_Validation_Allele2 and
+ Match_Norm_Validation_Allele1 != Match_Norm_Validation_Allele2 and
+ Tumor_Validation_Allele1 == (Match_Norm_Validation_Allele1 or
+ Match_Norm_Validation_Allele2).
+
+10. Check that Start_position \<= End_position
+
+11. Check for the Start_position and End_position against Variant_Type:
+
+ 1. If Variant_Type == "INS", then (End_position - Start_position + 1 ==
+ length (Reference_Allele) or End_position - Start_position == 1) and
+ length(Reference_Allele) \<= length(Tumor_Seq_Allele1 and
+ Tumor_Seq_Allele2)
+
+ 2. If Variant_Type == "DEL", then End_position - Start_position + 1 ==
+ length (Reference_Allele), then length(Reference_Allele) \>=
+ length(Tumor_Seq_Allele1 and Tumor_Seq_Allele2)
+
+ 3. If Variant_Type == "SNP", then length(Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 1 and (Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) != "-"
+
+ 4. If Variant_Type == "DNP", then length(Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 2 and (Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
+
+ 5. If Variant_Type == "TNP", then length(Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 3 and (Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
+
+ 6. If Variant_Type == "ONP", then length(Reference_Allele) ==
+ length(Tumor_Seq_Allele1) == length(Tumor_Seq_Allele2) \> 3 and
+ (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain
+ "-"
+
+12. Validation for UUID-based files:
+
+ 1. Column \#33 must be Tumor_Sample_UUID containing UUID of the BCR aliquot
+ for tumor sample
+
+ 2. Column \#34 must be Matched_Norm_Sample_UUID containing UUID of the BCR
+ aliquot for matched normal sample
+
+ 3. Metadata represented by Tumor_Sample_Barcode and
+ Matched_Norm_Sample_Barcode should correspond to the UUIDs assigned to
+ Tumor_Sample_UUID and Matched_Norm_Sample_UUID respectively
+
+13. If Validation_Status == "Valid" or "Invalid", then Validation_Method !=
+ "none" (case insensitive) .
+
+MAF naming convention
+=====================
+
+In archives uploaded to the DCC, the MAF file name should relate to the
+containing archive name in the following way:
+
+If the archive has the name
+
+ \_\.\.Level_2.\.\.0.tar.gz
+
+then a somatic MAF file with the archive should be named according to
+
+ \_\.\.Level_2.\[.\].somatic.maf
+
+and a protected MAF with the archive should be named according to
+
+ \_\.\.Level_2.\[.\].protected.maf
+
+The \ may consist of alphanumeric characters, dash, and
+underscore; no spaces or periods; or it may be left out altogether. The purpose
+of the optional tag is to impart some brief annotation.
+
+*Example*
+
+For the archive
+
+ genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.6.0.tar.gz
+
+the following are examples of valid maf names
+
+ genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.somatic.maf
+ genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.protected.maf
diff --git a/docs/Encyclopedia/ReadyForApproval/Portion.md b/docs/Encyclopedia/ReadyForApproval/Portion.md
new file mode 100644
index 000000000..e2193c608
--- /dev/null
+++ b/docs/Encyclopedia/ReadyForApproval/Portion.md
@@ -0,0 +1,15 @@
+# Portion #
+## Description ##
+An portion is a physical piece of any sample.
+## Overview ##
+A portion is typically one of several sequential 100-120 mg sections of a vial. The [GDC Data Model](https://gdc.cancer.gov/developers/gdc-data-model/gdc-data-model-components)
+relates portions to samples and/or analytes but is not a required biospecimen entity.
+
+## References ##
+1. [GDC Data Dictionary - Portion](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=portion)
+1. [TCGA Enyclopedia - Portion](hhttps://wiki.nci.nih.gov/display/TCGA/Portion)
+
+## External Links ##
+* N/A
+
+Categories: General, Biospecimen
diff --git a/docs/Encyclopedia/Under_Development/DAVE.md b/docs/Encyclopedia/Under_Development/DAVE.md
new file mode 100644
index 000000000..33bc88c26
--- /dev/null
+++ b/docs/Encyclopedia/Under_Development/DAVE.md
@@ -0,0 +1,16 @@
+# DAVE #
+
+## Description ##
+The NCI Genomic Data Commons's DAVE (Data Analysis, Visualization, and Exploration) tools
+are an open access interactive visualization application created to interact with the data stored in the GDC. Analysis can be performed in real time and online without downloading any of the data.
+## Overview ##
+
+
+### Tools ###
+## References ##
+1.[GDC DAVE]https://gdc.cancer.gov/dave-factsheet
+
+## External Links ##
+* N/A
+
+Categories: Workflow Type
diff --git a/docs/Encyclopedia/Under_Development/FASTQ.md b/docs/Encyclopedia/Under_Development/FASTQ.md
deleted file mode 100644
index b6d0f3ee8..000000000
--- a/docs/Encyclopedia/Under_Development/FASTQ.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# FASTQ #
-## Description ##
-## Overview ##
-### Structure ###
-#### Header (Optional) ####
-#### Body (Optional) ####
-## References ##
-1.
-
-## External Links ##
-* TBD
-
-Categories: Data Format
diff --git a/docs/Encyclopedia/Under_Development/Node.md b/docs/Encyclopedia/Under_Development/Node.md
new file mode 100644
index 000000000..2e3c23c54
--- /dev/null
+++ b/docs/Encyclopedia/Under_Development/Node.md
@@ -0,0 +1,3 @@
+# Node #
+## Description ##
+Please see [Entity]https://docs.gdc.cancer.gov/Encyclopedia/pages/Entity/
diff --git a/docs/Encyclopedia/Under_Development/Portion.md b/docs/Encyclopedia/Under_Development/Portion.md
index 090f7509a..83f0ba7b7 100644
--- a/docs/Encyclopedia/Under_Development/Portion.md
+++ b/docs/Encyclopedia/Under_Development/Portion.md
@@ -2,7 +2,7 @@
## Description ##
An portion is a physical sub-part of any sample..
## Overview ##
-A portion is typically one of several sequential 100-120 mg sections of a vial. The [GDC Data Model](https://gdc.cancer.gov/developers/gdc-data-model/gdc-data-model-components)
+A portion is typically one of several sequential 100-120 mg sections of a vial. The [GDC Data Model](https://gdc.cancer.gov/developers/gdc-data-model/gdc-data-model-components)
relates portions to samples and/or analytes but is not a required biospecimen entity.
## References ##
diff --git a/docs/Encyclopedia/ReadyForApproval/Aligned_Reads.md b/docs/Encyclopedia/pages/Aligned_Reads.md
similarity index 100%
rename from docs/Encyclopedia/ReadyForApproval/Aligned_Reads.md
rename to docs/Encyclopedia/pages/Aligned_Reads.md
diff --git a/docs/Encyclopedia/pages/Analyte.md b/docs/Encyclopedia/pages/Analyte.md
new file mode 100644
index 000000000..832240b01
--- /dev/null
+++ b/docs/Encyclopedia/pages/Analyte.md
@@ -0,0 +1,15 @@
+# Analyte #
+## Description ##
+An analyte is any substance or sample being analyzed.
+
+## Overview ##
+An analyte is the specimen extracted for analysis from a portion or sample using a specific extraction protocol.
+The [GDC Data Model](https://gdc.cancer.gov/developers/gdc-data-model/gdc-data-model-components) relates analytes to aliquots or portions or samples but is not a required biospecimen entity.
+
+## References ##
+1. [GDC Data Dictionary - Analyte](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=analyte)
+
+## External Links ##
+# [Analyte Wikipedia](https://en.wikipedia.org/wiki/Analyte)
+
+Categories: General, Biospecimen
diff --git a/docs/Encyclopedia/pages/Annotations.md b/docs/Encyclopedia/pages/Annotations.md
new file mode 100644
index 000000000..13b956954
--- /dev/null
+++ b/docs/Encyclopedia/pages/Annotations.md
@@ -0,0 +1,27 @@
+Annotations
+===========================
+
+Annotations contain important information about files, cases, or metadata nodes that may be of use to data downloaders when analyzing GDC data. They should be reviewed prior to running an analysis. An annotation may include key comments about why particular patients, samples, or files are absent from the GDC or why they may exhibit critical differences from others. Annotations include information that cannot be submitted to the GDC through other existing nodes or properties.
+
+Annotations are automatically downloaded in TSV format with impacted files when using the Data Transfer Tool. They may also be searched via the [API](/API/Users_Guide/Search_and_Retrieval/#annotations-endpoint) or on the annotations page of the [GDC Data Portal](https://portal.gdc.cancer.gov/annotations). Instructions on accessing annotations in the GDC Data Portal are found in the [GDC Data Portal User Guide](/Data_Portal/Users_Guide/Repository/#annotations-view).
+
+For information on Annotation structure and content please review the [GDC Data Dictionary](/Data_Dictionary/viewer/#?view=table-definition-view&id=annotation)
+
+For information about TCGA conventions for annotations please see the [TCGA Introduction to Annotations](Annotations_TCGA/).
+
+If a submitter would like to create an annotation, please contact the GDC Support Team (support@nci-gdc.datacommons.io).
+
+
+
+## References ##
+1. [API User Guide](/API/Users_Guide/Search_and_Retrieval/#annotations-endpoint)
+2. [GDC Data Portal](https://portal.gdc.cancer.gov/annotations)
+3. [GDC Data Portal User Guide](/Data_Portal/Users_Guide/Repository/#annotations-view)
+4. [GDC Data Dictionary](/Data_Dictionary/viewer/#?view=table-definition-view&id=annotation)
+5. [TCGA Annotations](Annotations_TCGA/)
+
+
+## External Links ##
+
+
+Categories: Data Type
diff --git a/docs/Encyclopedia/pages/Introduction+to+Annotations.md b/docs/Encyclopedia/pages/Annotations_TCGA.md
similarity index 96%
rename from docs/Encyclopedia/pages/Introduction+to+Annotations.md
rename to docs/Encyclopedia/pages/Annotations_TCGA.md
index 0c8379427..81c5b8a00 100644
--- a/docs/Encyclopedia/pages/Introduction+to+Annotations.md
+++ b/docs/Encyclopedia/pages/Annotations_TCGA.md
@@ -1,7 +1,8 @@
Introduction to Annotations
===========================
-This document is retained here for reference purposes and should not be considered the current standard.
-Document was adapted from https://wiki.nci.nih.gov/pages/viewpage.action?spaceKey=TCGA&title=Introduction+to+Annotations
+This document is retained for reference purposes for TCGA and should not be considered the current GDC standard. For information on the existing GDC use of annotations please see the [Annotations Encyclopedia entry](/Encyclopedia/pages/Annotations/).
+
+This document was adapted from https://wiki.nci.nih.gov/pages/viewpage.action?spaceKey=TCGA&title=Introduction+to+Annotations
This section includes the following topics.
diff --git a/docs/Encyclopedia/pages/Mutation_Annotation_Format_TCGAv2.md b/docs/Encyclopedia/pages/Mutation_Annotation_Format_TCGAv2.md
new file mode 100644
index 000000000..7e20fc7f5
--- /dev/null
+++ b/docs/Encyclopedia/pages/Mutation_Annotation_Format_TCGAv2.md
@@ -0,0 +1,373 @@
+Mutation Annotation Format (MAF) - Legacy TCGA Specification
+==============================================
+
+
+*This definition was taken from the previously public wiki hosted by TCGA and reflects the MAF format
+that was available during the active period of the TCGA project.*
+
+
+
+
+**Document Information**
+
+The spec has been reverted to the June 26th version (version 20). Additional
+changes are the removal of the "under construction" banner, changing all text to
+black, and fixing a typo in the link to the MAF 2.2 specification.
+
+**Specification for Mutation Annotation Format**
+Version 2.4.1
+June 20, 2014
+
+**Contents**
+
+- 1 Current version changes
+
+- 2 About MAF specifications
+
+ - 2.1 Definition of open access MAF
+ data
+
+ - 2.2 Somatic MAF vs. Protected
+ MAF
+
+- 3 MAF file fields
+
+ - 3.1 Table 1 - File column
+ headers
+
+- 4 MAF file checks
+
+- 5 MAF naming convention
+
+- 6 Previous specification
+ versions
+
+Current version changes
+=======================
+
+This current revision is **version 2.4.1** of the Mutation Annotation Format
+(MAF) specification.
+
+The following items in the specification were added or modified in version 2.4.1
+from version 2.4:
+
+- Header for MAF file is "\#version 2.4.1"
+
+- "Somatic" and "None" are the only acceptable values for "Mutation_Status"
+ for a somatic.MAF (named .somatic.maf). When Mutation_Status is None,
+ Validation_Status must be Invalid.
+
+- Centers need to make sure that Mutations_Status "None" doesn't include
+ germline mutation.
+
+- For a somatic MAF, following rules should be satisfied:
+ SOMATIC = (A AND (B OR C OR D)) OR (E AND F)
+ A: *Mutation_Status* == "Somatic"
+ B: *Validation_Status* == "Valid"
+ C. *Verification_Status* == "Verified"
+ D. *Variant_Classification* is not {Intron, 5'UTR, 3'UTR, 5'Flank, 3'Flank,
+ IGR}, which implies that *Variant_Classification* can only be
+ \\{Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins,
+ Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site,
+ Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region}.
+ E: *Mutations_status == "None"*
+ F: *Validation_status == "Invalid"*
+
+- Extra validation rules: If Validation_Status == Valid or Invalid, then
+ Validation_Method != none (case insensitive).
+
+About MAF specifications
+========================
+
+Mutation annotation files should be transferred to the DCC. Those files should
+be formatted using the mutation annotation format (MAF) that is described below.
+File naming convention is also
+[below](#MutationAnnotationFormat(MAF)Specificat).
+
+Following categories of somatic mutations are reported in MAF files:
+
+- Missense and nonsense
+
+- Splice site, defined as SNP within 2 bp of the splice junction
+
+- Silent mutations
+
+- Indels that overlap the coding region or splice site of a gene or the
+ targeted region of a genetic element of interest.
+
+- Frameshift mutations
+
+- Mutations in regulatory regions
+
+### Definition of open access MAF data
+
+A large proportion of MAFs are submitted as discovery data and sites labeled as
+somatic in these files overlap with known germline variants. In order to
+minimize germline contamination in putative (unvalidated) somatic calls, certain
+filtering criteria have been imposed. Based on current policy, open access MAF
+data should:
+
+- **include** all validated somatic mutation calls
+
+- **include** all unvalidated somatic mutation calls that overlap with a
+ coding region or splice site
+
+- **exclude** all other types of mutation calls (i.e., non-somatic calls
+ (validated or not), unvalidated somatic calls that are not in coding region
+ or splice sites, and dbSNP sites that are not annotated as somatic in dbSNP,
+ COSMIC or OMIM)
+
+
+
+### Somatic MAF vs. Protected MAF
+
+Centers will submit to the DCC MAF archives that contain Somatic MAF
+(named**.somatic.maf**) for open access data and an all-inclusive Protected MAF
+(named**.protected.maf**) that does not filter any data out and represents the
+original super-set of mutation calls. The files will be formatted using the
+Mutation Annotation Format (MAF).
+
+The following table lists some of the critical attributes of somatic and
+protected MAF files and provides a comparison.
+
+| Attribute | Somatic MAF | Protected MAF |
+| ----------- | ----------- | ------------- |
+| **File naming** | Somatic MAFs should be named as**\*.somatic.maf**and cannot contain 'germ' or 'protected' in file name. | Protected MAFs should be named as**\*.protected.maf**and should not contain 'somatic' in the file name. |
+| **Mutation category** | Somatic MAFs can only contain entries where*Mutation_Status*is "Somatic". If any other value is assigned to the field, the archive will fail. Experimentally validated or unvalidated (see next row) somatic mutations can be included in the file. | There is no such restriction for protected MAF. The file should contain all mutation calls including those from which .somatic.maf is derived. |
+| **Filtering criteria** | In order to minimize germline contamination, somatic MAFs can contain unvalidated somatic mutations only from coding regions and splice sites, which implies: | There are no such constraints for mutations in protected MAF. |
+| | If *Validation_Status* **is**"Unknown",*V a riant_Classification* **cannot** be 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, or Intron.*Variant_Classification*can only be \\{Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame\\}. | |
+| | There is no such constraint for experimentally validated (*Validation_Status*is "Valid") somatic mutations. | |
+| | | |
+| | dbSNP sites that are not annotated as somatic in dbSNP, COSMIC or OMIM must be removed from somatic MAFs. | |
+| **Access level** | These files are deployed as open access data. | These files are deployed as protected data. |
+
+MAF file fields
+===============
+
+The format of a MAF file is tab-delimited columns. Those columns are described
+in Table 1 and are required in every MAF file. The order of the columns will be
+validated by the DCC. Column headers and values **are** case sensitive where
+specified. Columns may allow null values (i.e.\_ blank cells) and/or have
+enumerated values. **The validator looks for a header stating the version of the
+specification to validate against (e.g. \#version 2.4). If not, validation
+fails.** Any columns that come after the columns described in Table 1 are
+optional. Optional columns are not validated by the DCC and can be in any order.
+
+
+
+Table 1 - File column headers
+-----------------------------
+
+
+
+| **Index** | **MAF Column Header** | **Description of Values** | **Example** | **Case Sensitive** | **Null** | **Enumerated** |
+| --------- | --------------------- | ------------------------- | ----------- | ------------------ | -------- | -------------- |
+| 1 | Hugo_Symbol | HUGO symbol for the gene (HUGO symbols are *always* in all caps). If no gene exists within 3kb enter "Unknown". |EGFR | Yes | No | Set or Unknown | | | | | | | | | |
+| | | Source: | | | | | | | | | | | | | |
+| 2 | Entrez_Gene_Id | Entrez gene ID (an integer). If no gene exists within 3kb enter "0". | 1956 | No | No | Set | | | | | | | | | |
+| | | Source: | | | | | | | | | | | | | |
+| 3 | Center | Genome sequencing center reporting the variant. If multiple institutions report the same mutation separate list using semicolons. Non-GSC centers will be also supported if center name is an accepted center name. | hgsc.bcm.edu;genome.wustl.edu | Yes | No | Set | | | | | | | | | |
+| 4 | NCBI_Build | Any TGCA accepted genome identifier. Can be string, integer or a float. | hg18, hg19, GRCh37, GRCh37-lite, 36, 36.1, 37, | No | No | Set and Enumerated. | | | | | | | | | |
+| 5 | Chromosome | Chromosome number without "chr" prefix that contains the gene. | X, Y, M, 1, 2, etc. | Yes | No | Set | | | | | | | | | |
+| 6 | Start_Position | Lowest numeric position of the reported variant on the genomic reference sequence. Mutation start coordinate (1-based coordinate system). | 999 | No | No | Set | | | | | | | | | |
+| 7 | End_Position | Highest numeric genomic position of the reported variant on the genomic reference sequence. Mutation end coordinate (inclusive, 1-based coordinate system). | 1000 | No | No | Set | | | | | | | | | |
+| 8 | Strand | Genomic strand of the reported allele. Variants should always be reported on the positive genomic strand. (Currently, only the positive strand is an accepted value). | \+ | No | No | \+ | | | | | | | | | |
+| 9 | Variant_Classification | Translational effect of variant allele. | Missense_Mutation | Yes | No | Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR *(See Notes Section #1)* , Intron, RNA, Targeted_Region | | | | | | | | | |
+| 10 | Variant_Type | Type of mutation. TNP (tri-nucleotide polymorphism) is analogous to DNP but for 3 consecutive nucleotides. ONP (oligo-nucleotide polymorphism) is analogous to TNP but for consecutive runs of 4 or more. | INS | Yes | No | SNP, DNP, TNP, ONP, INS, DEL, or Consolidated *(See Notes Section #2)* ) | | | | | | | | | |
+| 11 | Reference_Allele | The plus strand reference allele at this position. Include the sequence deleted for a deletion, or "-" for an insertion. | A | Yes | No | A,C,G,T and/or - | | | | | | | | | |
+| 12 | Tumor_Seq_Allele1 | Primary data genotype. Tumor sequencing (discovery) allele 1. " -" for a deletion represent a variant. "-" for an insertion represents wild-type allele. Novel inserted sequence for insertion should not include flanking reference bases. | C | Yes | No | A,C,G,T and/or - | | | | | | | | | |
+| 13 | Tumor_Seq_Allele2 | Primary data genotype. Tumor sequencing (discovery) allele 2. " -" for a deletion represents a variant. "-" for an insertion represents wild-type allele. Novel inserted sequence for insertion should not include flanking reference bases. | G | Yes | No | A,C,G,T and/or - | | | | | | | | | |
+| 14 | dbSNP_RS | Latest dbSNP rs ID (dbSNP_ID) or "novel" if there is no dbSNP record. source: | rs12345 | Yes | Yes | Set or "novel" | | | | | | | | | |
+| 15 | dbSNP_Val_Status | dbSNP validation status. Semicolon- separated list of validation statuses. | by2Hit2Allele;byCluster | No | Yes | by1000genomes;by2Hit2Allele; byCluster; byFrequency; byHapMap; byOtherPop; bySubmitter; alternate_allele *(See Notes Section #3)* **Note that "none" will no longer be an acceptable value.** | | | | | | | | | |
+| 16 | Tumor_Sample_Barcode | BCR aliquot barcode for the tumor sample including the two additional fields indicating plate and well position. i.e. TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The full TCGA Aliquot ID. | TCGA-02-0021-01A-01D-0002-04 | Yes | No | Set | | | | | | | | | |
+| 17 | Matched_Norm_Sample_Barcode | BCR aliquot barcode for the matched normal sample including the two additional fields indicating plate and well position. i.e. TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The full TCGA Aliquot ID; e.g. TCGA-02-0021-10A-01D-0002-04 (compare portion ID '10A' normal sample, to '01A' tumor sample). | TCGA-02-0021-10A-01D-0002-04 | Yes | No | Set | | | | | | | | | |
+| 18 | Match_Norm_Seq_Allele1 | Primary data. Matched normal sequencing allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | T | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 19 | Match_Norm_Seq_Allele2 | Primary data. Matched normal sequencing allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | ACGT | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 20 | Tumor_Validation_Allele1 | Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | \- | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 21 | Tumor_Validation_Allele2 | Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | A | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 22 | Match_Norm_Validation_Allele1 | Secondary data from orthogonal technology. Matched normal genotyping (validation) for allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | C | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 23 | Match_Norm_Validation_Allele2 | Secondary data from orthogonal technology. Matched normal genotyping (validation) for allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | G | Yes | Yes | A,C,G,T and/or - | | | | | | | | | |
+| 24 | Verification_Status *(See Notes Section #4)* | Second pass results from independent attempt using same methods as primary data source. Generally reserved for 3730 Sanger Sequencing. | Verified | Yes | Yes | Verified, Unknown | | | | | | | | | |
+| 25 | Validation_Status *(See Notes Section #5)* | Second pass results from orthogonal technology. | Valid | Yes | No | Untested, Inconclusive, Valid, Invaild | | | | | | | | | |
+| 26 | Mutation_Status | Updated to reflect validation or verification status and to be in agreement with the [VCF VLS](https://wiki.nci.nih.gov/x/2gcYAw) field. The values allowed in this field are constrained by the value in the Validation_Status field. | Somatic | Yes | No | **Validation_Status values:** Untested, Inconslusive, Valid, Invalid - **Allowed Mutations_Status Values for Untested and Inconclusive:** *(See Notes Seciton #6)* None, Germline, Somatic, LOH, Post-transcriptional modification **Unknown Allowed Mutation_status Values for Valid:** *(See Notes Seciton #6)* Germline, Somatic, LOH, Post-transcriptional modification, Unknown - **Allowed Mutations_Status Values for Invalid:** *(See Notes Seciton #6)* none | | | | | | | | | |
+ | | | | | | | | | | | | | | |
+| 27 | Sequencing_Phase | TCGA sequencing phase. Phase should change under any circumstance that the targets under consideration change. | Phase_I | No | Yes | No | | | | | | | | | |
+| 28 | Sequence_Source | Molecular assay type used to produce the analytes used for sequencing. Allowed values are a subset of the [SRA 1.5](http://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/sra/doc/SRA_1-5/) library_strategy field values. This subset matches those used at CGHub. | WGS;WXS | Yes | No | **Common TCGA values:** WGS, WGA, WXS, RNA-Seq, miRNA-Seq, Bisulfite-Seq, VALIDATION, Other **Other allowed values (per SRA 1.5)** ncRNA-Seq, WCS, CLONE, POOLCLONE, AMPLICON, CLONEEND, FINISHING, ChIP-Seq, MNase-Seq, DNase-Hypersensitivity, EST, FL-cDNA, CTS, MRE-Seq, MeDIP-Seq, MBD-Seq, Tn-Seq, FAIRE-seq, SELEX, RIP-Seq, ChIA-PET
+ | | | | | | | | | |
+| 29 | Validation_Method | The assay platforms used for the validation call. Examples: Sanger_PCR_WGA, Sanger_PCR_gDNA, 454_PCR_WGA, 454_PCR_gDNA; separate multiple entries using semicolons. | Sanger_PCR_WGA;Sanger_PCR_gDNA | No | **NO**. I**f Validation_Status = Untested then "none"** If Validation_Status = Valid or Invalid, then not "none" (case insensitive) | No | | | | | | | | | |
+| 30 | Score | Not in use. | NA | No | Yes | No | | | | | | | | | |
+| 31 | BAM_File | Not in use. | NA | No | Yes | No | | | | | | | | | |
+| 32 | Sequencer | Instrument used to produce primary data. Separate multiple entries using semicolons. | Illumina GAIIx;SOLID | Yes | No | Illumina GAIIx, Illumina HiSeq, SOLID, 454, ABI 3730xl, Ion Torrent PGM, Ion Torrent Proton, PacBio RS, Illumina MiSeq, Illumina HiSeq 2500, 454 GS FLX Titanium, AB SOLiD 4 System | | | | | | | | | |
+| 33 | Tumor_Sample_UUID | BCR aliquot UUID for tumor sample | 550e8400-e29b-41d4-a716-446655440000 | Yes | No | | | | | | | | | | |
+| 34 | Matched_Norm_Sample_UUID | BCR aliquot UUID for matched normal | 567e8487-e29b-32d4-a716-446655443246 | Yes | No |
+
+**Notes**
+*1 Intergenic Region.*
+*2 Consolidationd is used to indicate a site that was initially reported as a variant but subsequently removed from further analysis because it was consolidated into a new variant. For example, a SNP variant incorporated into a TNP variant.*
+*3 Used when the discovered varieant differs from that of dbSNP.*
+*4 These MAF headers describe the technology that was used to confirm a mutation, whether the same technology ("verification") or a different technology ("validation") is used to prove that a variant is germline or a somatic mutation.*
+*5 These MAF headers describe the technology that was used toconfirm a mutation, whether the same technology (verification) or a different technology (validation) is used to prove that a variant is germline or a somatic mutation.*
+*6 Explanation of some Validation Status-Mutation Status combinations.*
+
+| Validation Status | Mutation Status | Explanation |
+| ------------------ | --------------- | ----------- |
+| Valid | Unknown | a valid variant with unknown somatic status due to lack of data from matched normal tissue. |
+| Invalid | None | validation attempted, tumor and normal are homozygous reference (formerly described as Wildtype) |
+| Inconclusive | Unknown | validation failed, neither the genotype nor its somatic status is certain due to lack of data from matched normal tissue |
+| Inconclusive | None | validation failed, tumor genotype appears to be homozygous reference |
+
+ Important Criteria
+
+ **Index column indicates the order in which the columns are expected**. **All
+ headers are case sensitive.** The Case Sensitive column specifies which values
+ are case sensitive. The Null column indicates which MAF columns are allowed to
+ have null values. The Enumerated column indicates which MAF columns have
+ specified values: an Enumerated value of "No" indicates that there are no
+ specified values for that column; other values indicate the specific values
+ listed allowed; a value of "Set" indicates that the MAF column values come from
+ a specified set of known values (*e.g.*HUGO gene symbols).
+
+
+MAF file checks
+===============
+
+The DCC Archive Validator checks the integrity of a MAF file. Validation will
+fail if any of the below are not true for a MAF file:
+
+1. Column header text (including case) and order must match specification
+ (Table 1) exactly
+
+2. Values under column headers listed in the specification (Table 1) as not
+ null must have values
+
+3. Values that are specified in Table 1 as Case Sensitive must be.
+
+4. If column headers are listed in the specification as having *enumerated*
+ values (*i.e.* a "Yes" in the "Enumerated" column), then the values under
+ those column must come from the enumerated values listed under "Enumerated".
+
+5. If column headers are listed in the specification as having *set* values
+ (*i.e.* a "Set" in the "Enumerated" column), then the values under those
+ column must come from the enumerated values of that domain (*e.g.* HUGO gene
+ symbols).
+
+6. All Allele-based columns must contain- (deletion), or a string composed of
+ the following capitalized letters: A, T, G, C.
+
+7. IfValidation_Status== "Untested"
+ thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2can
+ be null (depending onValidation_Status).
+
+ 1. IfValidation_Status== "Inconclusive"
+ thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2can
+ be null (depending onValidation_Status)**.**
+
+8. If Validation_Status == Valid, then Validated_Tumor_Allele1 and
+ Validated_Tumor_Allele2must be populated (one of A, C, G, T, and -)
+
+ 1. If Validation_Status == "Valid" then Tumor_Validation_Allele1,
+ Tumor_Validation_Allele2, Match_Norm_Validation_Allele1,
+ Match_Norm_Validation_Allele2 cannot be null
+
+ 2. IfValidation_Status== "Invalid"
+ thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2cannot
+ be null AND Tumor_Validation_Allelle1 ==
+ Match_Norm_Validation_Allele1AND Tumor_Validation_Allelle2 ==
+ Match_Norm_Validation_Allele2 (Added as a replacement for 8a as a
+ result of breakdown)
+
+9. Check allele values against Mutation_Status:
+ Check allele values against Validation_status:
+
+ 1. If Mutation_Status == "Germline" and Validation_Status == "Valid", then
+ Tumor_Validation_Allele1 == Match_Norm_Validation_Allele1 and
+ Tumor_Validation_Allele2 == Match_Norm_Validation_Allele2.
+
+ 2. If Mutation_Status == "Somatic" and Validation_Status == "Valid", then
+ Match_Norm_Validation_Allele1 == Match_Norm_Validation_Allele2 ==
+ Reference_Allele and (Tumor_Validation_Allele1 or
+ Tumor_Validation_Allele2) != Reference_Allele
+
+ 3. If Mutation_Status == "LOH" and Validation_Status=="Valid", then
+ Tumor_Validation_Allele1 == Tumor_Validation_Allele2 and
+ Match_Norm_Validation_Allele1 != Match_Norm_Validation_Allele2 and
+ Tumor_Validation_Allele1 == (Match_Norm_Validation_Allele1 or
+ Match_Norm_Validation_Allele2).
+
+10. Check that Start_position \<= End_position
+
+11. Check for the Start_position and End_position against Variant_Type:
+
+ 1. If Variant_Type == "INS", then (End_position - Start_position + 1 ==
+ length (Reference_Allele) or End_position - Start_position == 1) and
+ length(Reference_Allele) \<= length(Tumor_Seq_Allele1 and
+ Tumor_Seq_Allele2)
+
+ 2. If Variant_Type == "DEL", then End_position - Start_position + 1 ==
+ length (Reference_Allele), then length(Reference_Allele) \>=
+ length(Tumor_Seq_Allele1 and Tumor_Seq_Allele2)
+
+ 3. If Variant_Type == "SNP", then length(Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 1 and (Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) != "-"
+
+ 4. If Variant_Type == "DNP", then length(Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 2 and (Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
+
+ 5. If Variant_Type == "TNP", then length(Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 3 and (Reference_Allele and
+ Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
+
+ 6. If Variant_Type == "ONP", then length(Reference_Allele) ==
+ length(Tumor_Seq_Allele1) == length(Tumor_Seq_Allele2) \> 3 and
+ (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain
+ "-"
+
+12. Validation for UUID-based files:
+
+ 1. Column \#33 must be Tumor_Sample_UUID containing UUID of the BCR aliquot
+ for tumor sample
+
+ 2. Column \#34 must be Matched_Norm_Sample_UUID containing UUID of the BCR
+ aliquot for matched normal sample
+
+ 3. Metadata represented by Tumor_Sample_Barcode and
+ Matched_Norm_Sample_Barcode should correspond to the UUIDs assigned to
+ Tumor_Sample_UUID and Matched_Norm_Sample_UUID respectively
+
+13. If Validation_Status == "Valid" or "Invalid", then Validation_Method !=
+ "none" (case insensitive) .
+
+MAF naming convention
+=====================
+
+In archives uploaded to the DCC, the MAF file name should relate to the
+containing archive name in the following way:
+
+If the archive has the name
+
+ \_\.\.Level_2.\.\.0.tar.gz
+
+then a somatic MAF file with the archive should be named according to
+
+ \_\.\.Level_2.\[.\].somatic.maf
+
+and a protected MAF with the archive should be named according to
+
+ \_\.\.Level_2.\[.\].protected.maf
+
+The \ may consist of alphanumeric characters, dash, and
+underscore; no spaces or periods; or it may be left out altogether. The purpose
+of the optional tag is to impart some brief annotation.
+
+*Example*
+
+For the archive
+
+ genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.6.0.tar.gz
+
+the following are examples of valid maf names
+
+ genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.somatic.maf
+ genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.protected.maf
diff --git a/docs/Encyclopedia/pages/TCGA_VCF_1.1v2.md b/docs/Encyclopedia/pages/TCGA_VCF_1.1v2.md
new file mode 100644
index 000000000..b8d17d479
--- /dev/null
+++ b/docs/Encyclopedia/pages/TCGA_VCF_1.1v2.md
@@ -0,0 +1,1351 @@
+TCGA Variant Call Format (VCF) 1.1 Specification
+================================================
+
+**Document Information**
+This document is retained here for reference purposes and should not be considered the current standard.
+
+
+**Specification for TCGA Variant Call Format (VCF)**
+Version 1.1
+
+
+Please note that VCF files are treated as **protected** data and must be
+submitted to the DCC only in **Level 2** archives.
+
+About TCGA VCF specification
+============================
+
+Variant Call Format (VCF) is a format for storing and reporting genomic sequence
+variations. VCF files are modular where the annotations and genotype information
+for a variant are separated from the call itself. As of May 2011, VCF version
+4.1 (described
+[here](http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41))
+is the most recent release. GSCs will generate sequence variation data using
+high-throughput sequencing technologies and resulting variations will be
+submitted to DCC as VCF files. TCGA has adopted VCF 4.1 with certain
+modifications to support supplemental information specific to the project.
+Subsequent sections describe the format TCGA VCF files should follow and
+validation steps that would have to be implemented at the DCC.
+
+Summary of current version changes
+==================================
+
+Following is a summary of additions/modifications for this version and the corresponding validation rule
+number is included in parentheses.
+
+**UUID compliance**: All TCGA data is currently in the process of being
+converted to be UUID-compliant. Until the conversion is complete and all centers
+are prepared to start submitting UUID-compliant data, some of the VCF files may
+adhere to UUID-based specification whereas some may still have barcodes.
+Non-UUID files will follow the specification described here but for
+UUID-compliance, VCF files should satisfy the following criteria.
+
+1. **SampleUUID** and **SampleTCGABarcode** are required tags in each
+ ##SAMPLE declaration. Please note that **SampleName** will not be a
+ required tag once submitting center has fully converted to UUIDs.
+
+ 1. Metadata represented by SampleTCGABarcode at the DCC should correspond
+ to the UUID assigned to SampleUUID.
+
+2. **Individual** is not a required tag in ##SAMPLE declaration.
+
+3. If ##**INDIVIDUAL** is declared in the header, all SampleUUIDs in the
+ header must correspond to the same participant, and the corresponding TCGA
+ barcode for that participant should be assigned to ##INDIVIDUAL.
+
+
+
+1. SampleName is a required tag in ##SAMPLE declaration. The value assigned
+ to SampleName should be a valid [aliquot
+ barcode](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/) / [UUID](https://docs.gdc.cancer.gov/Encyclopedia/pages/UUID/)
+ in the database. (#15b, #15h)
+
+2. Header declarations for INFO and FORMAT fields should match the values
+ defined in Tables 4 and 5 respectively. (#7a)
+
+3. Following FORMAT fields are required for all variant records in a VCF file:
+ (#10c)
+
+ - Genotype (**GT**)
+
+ - Read depth (**DP**)
+
+ - Reads supporting ALT (**AD** or **DP4**)
+
+ - Average base quality for reads supporting alleles (**BQ**)
+
+ - Somatic status of the variant (**SS**). SS can be 0, 1, 2, 3, 4 or 5
+ depending on whether relative to normal the variant is wildtype,
+ germline, somatic, LOH, post-transcriptional modification, or unknown
+ respectively. (#23)
+
+4. Values for INFO field **VLS** (validation status relative to non-adjacent
+ Normal) will be checked for validity. It can be 0, 1, 2, 3, 4, or 5 based on
+ whether the mutation is wildtype, germline, somatic, LOH, post
+ transcriptional modification, or unknown respectively. (#9c)
+
+5. Validation of tags in PEDIGREE declaration has changed as follows: (#16)
+
+ - Name_0, Name_1, etc. do not have to be these literal strings but instead
+ represent arbitrary strings.
+
+ - The keys and values used in the should be unique
+ across assignments in any given PEDIGREE declaration.
+
+ - Value assigned in does not have to be defined as a
+ SAMPLE in a genotype column or in the header.
+
+TCGA-specific customizations
+============================
+
+The VCF 4.1 specification has been customized to support TCGA-specific variant
+information. While majority of the steps pertaining to the basic structure of
+the file remain the same, checks for supplemental information fields have been
+introduced. For example, TCGA VCF specification allows for additional fields to
+represent data associated with complex rearrangements, RNA-Seq variants, and
+sample-specific metadata.
+
+All TCGA-specific additions and modifications in [validation
+steps](#validation-rules) are prefixed with a
+ tag for convenient comparison with 1000Genomes VCF 4.1. The
+following table summarizes TCGA-specific customizations that have been added to
+the VCF 4.1 specification. The first column, "Customization type", indicates
+whether a new validation step has been introduced or if an existing step has
+been modified
+
+**Table 1: TCGA-specific validation steps**
+
+| **Customization type** | **Description** | **Validation step # in TCGA-VCF 1.1 spec** | **Corresponding validation step # in VCF 4.1 spec** |
+|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------|------------------------------------------------------|
+| New | Validate that file contains ##tcgaversion HEADER line. Its presence indicates that the file is TCGA VCF and the value assigned to the field contains format version number | \--- | \--- |
+| New | Additional mandatory header lines (Please refer to [Table 2](#TCGAVariantCallFormat(VCF)1.1Specificat)) | \#1 | \#1 |
+| New | Validation of SAMPLE meta-information lines | \#15 | \--- |
+| New | Validation of PEDIGREE meta-information lines | \#16 | \--- |
+| Modification | Acceptable value set for CHROM has been modified | \#18a,b | \#16a |
+| Modification | Acceptable value set for ALT has been modified | \#19 | \#17 |
+| New | Validation for INFO sub-field "VT" has been added | \#22 | \--- |
+| New | Validation for FORMAT sub-field "SS" has been added | \#23 | \--- |
+| New | Validation for INFO/FORMAT sub-field "DP" has been added | \#24 | \--- |
+| New | Validation for complex rearrangement records has been added | \#25 | \--- |
+| New | Validation for RNA-Seq annotation fields has been added | \#26 | \--- |
+| New | Mandatory FORMAT fields have been added | \#10c | \--- |
+| New | Check for consistent definitions for INFO and FORMAT fields | \#7a | \--- |
+
+File format
+===========
+
+The following example (based on [VCF version
+4.1)](http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41)
+shows different components of a TCGA VCF file. Any VCF file contains two main
+sections. The HEADER section contains meta-information for variant records that
+are reported as individual rows in the BODY of the VCF file. Both sections are
+described below.
+
+**Case-sensitivity**: Please note that all fields and their associated
+validation rules are case-sensitive (as given in the specification) unless noted
+otherwise.
+
+**Figure 1: Components of a sample TCGA VCF file**
+
+| ![images](images/vcfExample_VCF.png) |
+|------------------------------------------|
+
+
+
+
+HEADER
+------
+
+The HEADER contains meta-information lines that provide supplemental information
+about variants contained in BODY of the file. HEADER lines could be formatted in
+the following two ways:
+
+
+ ##key=value
+
+ Example:
+
+ ##fileformat=VCFv4.1
+
+ ##fileDate=20090805
+
+
+or
+
+ ##FIELDTYPE=
+
+ Example:
+
+ ##INFO=
+
+Meta-information could be applicable either to all variant records in the file
+(e.g., date of creation of file) or to individual variants (e.g., flag to
+indicate whether a given variant exists in dbSNP).
+
+### Generic meta-information
+
+**Format**: *##key=value* OR *##FIELDTYPE=*
+
+The following table lists some of the reserved field names. Files can be
+customized to contain additional meta-information fields as long as they are not
+in conflict with reserved field names. The first field in Table 2 (fileformat)
+is mandatory and lists the VCF version number of the file.
+
+**Table 2: Examples of generic meta-information fields**
+
+| **Field** | **Case-Sensitive** | **Description** | **Sample values** | Required (fields in red are TCGA-specific requirements)
+| ------------- | --------------------- | --------------- | ------------------ | ----------------------------------------------------- |
+| Fileformat | No | Lists the VCF version number the file is based on; must be the first line in the file | ##fileformat=VCFv4.1 | Yes |
+| fileDate | No | Date file was created; should be in yyyymmdd format | ##fileDate=20090805 | Yes |
+| Tcgaversion | No | Indicates that the file follows TCGA-VCF specification. Format version number is assigned to the field. | ##tcgaversion=1.1 | Yes |
+| Reference | No | Reference build used for variant calling and against which variant coordinates are shown | ##reference=1000GenomesPilot-NCBI36 | Yes |
+| | | | | |
+| | | | OR | |
+| | | | | |
+| | | | ##reference= | |
+| Assembly | No | External assembly file. The field can be assigned a file name if assembly file is included in the archive submitted to the DCC or it can be a URL pointing to the file location. | ##assembly=[ftp://ftp-trace.ncbi.nih.gov/](ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/sv/breakpoint_assemblies.fasta) | Yes |
+| | | | [1000genomes/ftp/release/sv/breakpoint_assemblies.fasta](ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/sv/breakpoint_assemblies.fasta) | (if a contig from an assembly file is being referred to in the VCF file, especially for breakends) |
+| center | No | Name of the center where VCF file is generated. A comma-separated list can be provided if files from multiple centers are merged. | ##center="Broad" | Yes |
+| | | | | |
+| | | | OR | |
+| | | | | |
+| | | | ##center="Broad,UCSC,BCM" | |
+| phasing | No | Indicates whether genotype calls are partially phased (phasing=partial) or unphased (phasing=none) | ##phasing=none | Yes |
+| geneAnno | No | URL of the gene annotation source e.g., Generic Annotation File (GAF) | ##geneAnno=[https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files](https://api.gdc.cancer.gov/legacy/data/95c3618c-bd9e-4df4-96e4-ef8d54710e51) | Yes (if annotation tags like GENE, SID and RGN are used) |
+| vcfProcessLog | No | Lists algorithm, version and settings used to generate variant calls in a VCF file. If multiple VCF files are processed to produce a single merged file, the field records attributes for individual VCF files and the programs used to merge the files along with the associated version, parameters and contact information of the person who produced the merged file. | | | | | **Note**: If VCF file does not represent a set of merged files, *MergeSoftware*, *MergeParam*, *MergeVer* and *MergeContact* tags will not be applicable and can be omitted.
+ | | | | ##vcfProcessLog=, InputVCFSource=, | |
+| | | **Note**: If VCF file does not represent a set of merged files, *MergeSoftware*, *MergeParam*, *MergeVer* and *MergeContact* tags will not be applicable and can be omitted. | InputVCFVer=<1.0>, | |
+| | | | InputVCFParam= | |
+| | | **Note**: If multiple parameters need to be declared in *InputVCFParam*, key=value pairs can be used to name these parameters. For example: | InputVCFgeneAnno=> | |
+| | | InputVCFParam= | | |
+| | | If there are multiple files for which parameters have to be declared, following format can be used: | OR | |
+| | | InputVCFParam= | | |
+| | | | ##vcfProcessLog=, | |
+| | | | InputVCFSource=, | |
+| | | | InputVCFVer=<1.0,2.1,2.0>, | |
+| | | | InputVCFParam=, | |
+| | | | InputVCFgeneAnno=, | |
+| | | | MergeSoftware=, | |
+| | | | MergeParam=, | |
+| | | | MergeVer=<2.1,3.0>, | |
+| | | | MergeContact=> | |
+| INDIVIDUAL | No | Specifies the individual for which data is presented in the file | ##INDIVIDUAL=TCGA-24-0980 | No |
+
+### INFO/FORMAT/FILTER meta-information
+
+**Format**: *##FIELDTYPE=*
+
+INFO, FORMAT and FILTER (case-sensitive values) are optional fields that have to
+be declared in the HEADER if they are being referred to in BODY of the file.
+Different *keys* that can be used to define them are described in Table 3. All
+three fields do not use the same set of keys. Please refer to individual field
+definitions for further details.
+
+**Important**: TCGA VCF 1.1 requires all VCF files to follow consistent header
+declarations for standard INFO and FORMAT sub-fields. Please refer to Tables 4
+and 5 for details. If a sub-field exists in these tables and is used in a TCGA
+VCF file, then all pairs in the definition should match entries in
+the corresponding table for the file to pass validation.
+
+**Table 3: Description of keys used in INFO/FORMAT/FILTER meta-information
+declarations**
+
+| **Key** | **Case-sensitive** | **Description** | **Data Type (Possible values)** | **Additional Notes** |
+| ----------- | ------------------ | --------------- | ------------------------------- | -------------------- |
+| ID | Yes | name of the field; also used in BODY of the file to assign values for individual variant records | String, no whitespaces, no comma | \--- |
+| Number | Yes | specifies the number of values that can be associated with the corresponding field | Set | Any integer \>= 0 indicating number of values; |
+| | | | *(Integer \>= 0, "A", "G", ".")* | "A", if the field has one value per alternate allele; |
+| | | | | "G", if the field has one value per genotype; |
+| | | | | ".", if number of values varies, is unknown, or is unbounded |
+| Type | Yes | indicates data type of the value associated with the field | Set | "Flag" type indicates that the field does not contain a value entry, and hence the *Number* should be 0 in this case. FORMAT fields cannot have a "Flag" *Type* assigned to them. |
+| | | | *(Integer, Float, Flag, Character, String)* | |
+| Description | Yes | provides a brief description of the field | String, surrounded by double-quotes, cannot itself contain a double-quote, cannot contain trailing whitespace at the end of string before closing quotes | \--- |
+
+#### INFO lines
+
+**Format**: *##INFO=*
+**Required keys**: ID, Type, Number, Description
+
+INFO fields are optional and contain additional annotations for a variant.
+Certain INFO fields have already been created and exist as reserved fields in
+the current VCF standard. Custom INFO fields can be added based on study
+requirements as long as they do not use the reserved field names. If an INFO
+field is declared in the header, it needs to be described further using the
+following format:
+
+ ##INFO=
+
+ Example:
+
+ ##INFO=
+
+ ##INFO=
+
+#### FORMAT lines
+
+**Format**: *##FORMAT=*
+**Required keys**: ID, Type, Number, Description
+
+FORMAT declaration lines are used when annotations need to be added for
+individual genotypes associated with each sample in the file. FORMAT sub-fields
+are declared precisely as the INFO sub-fields with the exception that a FORMAT
+sub-field cannot be assigned a "Flag" *Type.*
+
+ ##FORMAT=
+
+ Example:
+
+ ##FORMAT=
+
+ ##FORMAT=
+
+**Important**: TCGA VCF 1.1 requires the following FORMAT sub-fields to be
+defined for all variant records. Therefore, these FORMAT lines are not optional
+for TCGA VCF files and should be declared in the header. Please refer to Table 5
+for definitions for these sub-fields.
+
+- Genotype (**GT**)
+
+- Read depth (**DP**)
+
+- Reads supporting ALT (**AD** or **DP4**). Either AD or DP4 is required to be
+ defined although DP4 is preferred.
+
+- Average base quality for reads supporting alleles (**BQ**)
+
+- Somatic status of the variant (**SS**). SS can be 0, 1, 2, 3, 4, or 5
+ depending on whether relative to normal the variant is wildtype, germline,
+ somatic, LOH, post-transcriptional modification, or unknown respectively.
+
+These should be considered as required fields so that they are included by
+default unless there is an exceptional scenario where the information for a
+field cannot be obtained. In such a case, "." can be used to indicate missing
+value.
+
+#### FILTER lines
+
+**Format**: *##FILTER=*
+**Required keys**: ID, Description
+
+FILTER fields are defined to list filtering criteria used for generating variant
+calls. Custom filters can be applied as long as a definition is provided in the
+HEADER. FILTERs that have been applied to the data should be described as
+follows. Please note that FILTER declarations do not include *Type* or *Number*
+keys.
+
+ ##FILTER=
+
+ Example:
+
+ ##FILTER=
+
+ ##FILTER=
+
+#### Consistent definitions for reserved INFO and FORMAT fields
+
+To ensure that all TCGA VCF files have consistent definitions for standard
+fields and to avoid merging errors due to contradicting definitions, following
+header declarations for common fields are proposed. The 'Source' column in the
+tables below indicates whether the field is from 1000Genomes VCF or if it is
+specific to TCGA-VCF. By adhering to these definitions, we can ensure that a
+given field is interpreted the same way across all centers and that same
+'Number', 'Type' and 'Description' values are used for these IDs.
+
+##### Table 4: INFO sub-field definitions
+
+| **Sub-field** | **Source** | **Formatted declaration** |
+|---------------|------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| AA | VCF | ##INFO= |
+| AC | VCF | ##INFO= |
+| AF | VCF | ##INFO= |
+| AN | VCF | ##INFO= |
+| BQ | VCF | ##INFO= |
+| CIGAR | VCF | ##INFO= |
+| DB | VCF | ##INFO= |
+| DP | VCF | ##INFO= |
+| END | VCF | ##INFO= |
+| H2 | VCF | ##INFO= |
+| H3 | VCF | ##INFO= |
+| MQ | VCF | ##INFO= |
+| MQ0 | VCF | ##INFO= |
+| NS | VCF | ##INFO= |
+| SB | VCF | ##INFO= |
+| SOMATIC | VCF | ##INFO= |
+| VALIDATED | VCF | ##INFO= |
+| 1000G | VCF | ##INFO= |
+| IMPRECISE | VCF | ##INFO= |
+| NOVEL | VCF | ##INFO= |
+| SVTYPE | VCF | ##INFO= |
+| SVLEN | VCF | ##INFO= |
+| CIPOS | VCF | ##INFO= |
+| CIEND | VCF | ##INFO= |
+| HOMLEN | VCF | ##INFO= |
+| HOMSEQ | VCF | ##INFO= |
+| BKPTID | VCF | ##INFO= |
+| MEINFO | VCF | ##INFO= |
+| METRANS | VCF | ##INFO= |
+| DGVID | VCF | ##INFO= |
+| DBVARID | VCF | ##INFO= |
+| DBRIPID | VCF | ##INFO= |
+| MATEID | VCF | ##INFO= |
+| PARID | VCF | ##INFO= |
+| EVENT | VCF | ##INFO= |
+| CILEN | VCF | ##INFO= |
+| DPADJ | VCF | ##INFO= |
+| CN | VCF | ##INFO= |
+| CNADJ | VCF | ##INFO= |
+| CICN | VCF | ##INFO= |
+| CICNADJ | VCF | ##INFO= |
+| VLS | TCGA-VCF | ##INFO= |
+| SID | TCGA-VCF | ##INFO= |
+| GENE | TCGA-VCF | ##INFO= |
+| RGN | TCGA-VCF | ##INFO= |
+| RE | TCGA-VCF | ##INFO= |
+| VT | TCGA-VCF | ##INFO= |
+
+##### Table 5: FORMAT sub-field definitions
+
+| **Sub-field** | **Source** | **Formatted declaration** |
+|---------------|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| GT | VCF | ##FORMAT= |
+| DP | VCF | ##FORMAT= |
+| FT | VCF | ##FORMAT= |
+| GL | VCF | ##FORMAT= |
+| PL | VCF | ##FORMAT= |
+| GP | VCF | ##FORMAT= |
+| GQ | VCF | ##FORMAT= |
+| HQ | VCF | ##FORMAT= |
+| CN | VCF | ##FORMAT= |
+| CNQ | VCF | ##FORMAT= |
+| CNL | VCF | ##FORMAT= |
+| MQ | VCF | ##FORMAT= |
+| HAP | VCF | ##FORMAT= |
+| AHAP | VCF | ##FORMAT= |
+| SS | TCGA-VCF | ##FORMAT= |
+| TE | TCGA-VCF | ##FORMAT= |
+| AD | TCGA-VCF | ##FORMAT= |
+| DP4 | TCGA-VCF | ##FORMAT= |
+| BQ | TCGA-VCF | ##FORMAT= |
+| VAQ | TCGA-VCF | ##FORMAT= |
+
+
+
+
+
+### TCGA-specific meta-information
+
+#### PEDIGREE lines
+
+**Format**: *##PEDIGREE=*
+**Required keys**: Name_0,..,Name_N where N \>= 1;
+
+PEDIGREE lines are used to specify derivation relationships between different
+genomes. *Name_0* is associated with the derived genome and *Name_1* through
+*Name_N* represent the genomes from which it is derived. In the case of tumor
+clonal populations, one population is clonally derived from another. In the
+example below, PRIMARY-TUMOR-GENOME is derived from GERMLINE-GENOME.
+
+ ##PEDIGREE=,Name_1=,...,Name_N=>
+
+ where N is \>= 1;
+
+ Example:
+
+ ##PEDIGREE=
+
+#### SAMPLE lines
+
+**Format**: *##SAMPLE=*
+**Required keys**: ID, SampleName, Individual, File, Platform, Source, Accession
+
+For UUID-compliant files, following rules should be followed:
+
+**Required keys**: ID, SampleName, Individual, SampleUUID, SampleTCGABarcode,
+File, Platform, Source, Accession
+
+- Value assigned to "SampleUUID" should be a valid [aliquot
+ UUID](https://docs.gdc.cancer.gov/Encyclopedia/pages/UUID/) in the database.
+
+- Value assigned to "SampleTCGABarcode" should represent the aliquot-level
+ metadata associated with SampleUUID. This metadata mapping is originally
+ received by the DCC from BCR.
+
+> Example:
+
+> ##SAMPLE=,Mixture=<0.1,0.9>,Genome_Description=<"Germline
+> contamination","Tumor genome">>
+
+
+
+SAMPLE lines are used to include additional metadata about each sample for which
+data is represented in the VCF file. All samples are listed in the column header
+line following the FORMAT column (Figure 1). Each of these samples should have
+its own HEADER declaration where the sample identifier in the column header
+should be the same as the value assigned to "ID" key in the corresponding
+declaration. Value assigned to "SampleName" should be a valid [aliquot
+barcode](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/) / [UUID](https://docs.gdc.cancer.gov/Encyclopedia/pages/UUID/)
+in the database. The declaration lists information about the sample (source,
+platform, source file, etc.) and can also be used to indicate if the sample is a
+mixture of different kind of genomes. In the example below, "Genomes", "Mixture"
+and 'Genome_Description" tags represent comma-separated list of different
+genomes that a sample contains, proportion of each genome in the sample, and a
+brief description of each genome respectively.
+
+##SAMPLE=,Mixture=
+,Genome_Description=<"S1","S2",..,"SK">>
+
+Example:
+
+##SAMPLE=,Mixture=<0.1,0.9>,Genome_Description=<"Germline
+contamination","Tumor genome">>
+
+- "Description" field for genome mixture has been renamed to
+ "Genome_Description" to distinguish it from sample description.
+
+- Values for tags related to genome mixture (Genomes, Mixture,
+ Genome_Description) are within angle brackets.
+
+### Column header meta-information
+
+**Format**: Tab-delimited line starting with "#" and containing headers for all
+columns in the BODY as shown below.
+
+This is a mandatory header line where the first 8 fields are fixed and have to
+defined in the column header. "FORMAT" onwards are optional and are included to
+encapsulate per-sample/genome genotype data.
+
+#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ...
+
+BODY
+### Variant records
+
+Data lines are tab-delimited and list information about individual variants and
+associated genotypes across samples. The first 8 fields (Figure 1) are required
+to be listed in the VCF column header line. Some of these fields require
+non-null values (see Table 6) for each record. For the remaining fixed fields,
+even if the field does not have an associated value, it still needs to be
+specified with a missing value identifier ("." in VCF 4.1). Subsequent fields
+are optional.
+
+**Table 6: Description of fields in the BODY of a VCF file**
+
+| **Index** | **Field** | **Case-sensitive** | **Description** | **Data type** | **Sample values** | **Required\*** | **Additional notes** |
+| --------- | ---------- | ------------------ | --------------- | ------------- | --------------------- | -------------- | -------------------- |
+| 1 | CHROM | Yes | *Chromosome*: an identifier from the reference genome or the assembly file defined in the HEADER. | Alphanumeric string | 20 | Yes | Chromosome name should not contain "chr" prefix, e.g., "chr10" will be an invalid entry |
+| | | | | *([1-22], X, Y, MT, )* | | | |
+| 2 | POS | Yes | *Position*: The reference position, with the 1st base having position 1. | Non-negative integer | 1110696 | Yes | \--- |
+| 3 | ID | Yes | *Identifier*: Semi-colon separated list of unique identifiers if available. | String, no white-space or semi-colons | rs6054257_66370 | No | **Important**: When using an rsID as the variant identifier, please append chromosomal location of the variant to the ID. For example, if the variant is at chr7:6013153 and the corresponding rsID is rs10000, then the variant ID should be rs10000_6013153. This is to ensure that there is a consistent rule for satisfying the condition for unique IDs even if a file contains single rsID that maps to multiple variants. |
+| 4 | REF | Yes | *Reference allele(s)*: Reference allele at the position. | String | GTCT | Yes | Value in POS field refers to the position of the first base in the REF string. |
+| | | | | *([ACGTN]+* ) | | | |
+| 5 | ALT | Yes | *Alternate allele(s)*: Comma separated list of alternate non-reference alleles called on at least one of the samples. Angle-bracketed ID String (\) can also be used for symbolically representing alternate alleles. | String; no whitespace, commas, or angle-brackets in the ID string | G,GTCT | No | if ALT==, ID needs to be defined in the header as |
+| | | | | *([ACGTN]+, , .)* | . | | ##ALT= |
+| | | | | | | | |
+| 6 | QUAL | Yes | *Quality score*: Phred-scaled quality score for the assertion made in ALT. | Integer \>= 0 | 50 | No | Scores should be non-negative integers or missing values |
+| 7 | FILTER | Yes | *Filtering results*: PASS if this position has passed all filters, Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. | String, no whitespace or semi-colon | PASS | No | "0" is reserved and cannot be used as a filter String. |
+| | | | | | q10;s50 | | |
+| 8 | INFO | Yes | *Additional information*: INFO fields are encoded as a semicolon-separated series of keys (same as ID in an INFO declaration) with optional values in the format **. | String, no whitespace, semi-colons, or equal-signs | NS=3;DP=14; | No | \--- |
+| 9 | FORMAT | Yes | *Genotype sub-fields*: If genotype data is present in the file, the fixed fields are followed by a FORMAT column. The field contains a colon-separated list of all pre-defined FORMAT sub-fields (same as ID in a FORMAT declaration) that are applicable to all samples that follow. | String, no whitespace, sub-fields cannot contain colon | GT:GQ:DP:HQ | No | "GT" must be the first sub-field if it is present in the FORMAT field. |
+| 10 | | Case should be same as in "ID" tag of \#\#SAMPLE declaration in the header | *Per-sample genotype information*: An arbitrary number of sample IDs can be added to the column header line and a variant record in the BODY can contain genotype information corresponding to FORMAT column for each sample. Contains a colon-separated list of values assigned to each of the sub-fields in FORMAT column. | String, no whitespace, sub-fields cannot contain colon | 0\|0:48:1:51,51 | No | Values are assigned to FORMAT sub-fields in the SAME order as specified in "FORMAT" column. All samples in any given row for a variant record MUST contain values for all sub-fields as defined in "FORMAT" column. If any of the fields does not have an associated value, then missing value identifier (".") should be used for that field. However, "." cannot be used as a value for any of the IDs in the FORMAT field (e.g., GT:.:DP would lead to an error). |
+
+* A "Required" field cannot contain missing value identifier for any record
+listed in data lines.
+
+Extensions for TCGA data
+========================
+
+TCGA data includes but is not limited to SNP's and small indels. A variant
+representation format for cancer data should be able to support more complex
+variation types such as structural variants, complex rearrangements and RNA-Seq
+variants. The following sub-sections present an overview of the extensions that
+have been added to clearly describe such variations in a VCF file.
+
+Structural variants
+-------------------
+
+A [structural variant](http://www.ncbi.nlm.nih.gov/dbvar/content/overview/) (SV)
+can be defined as a region of DNA that includes a variation in the structure of
+the chromosome. Such variations could be due to inversions and balanced
+translocations or genomic imbalances (insertions and deletions), also referred
+to as copy number variants (CNVs). Certain features have been added to the
+format in order to clearly describe structural variants in a VCF file. A
+detailed description of the extensions is available
+[here](http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41).
+
+Complex rearrangements
+----------------------
+
+Chromosomal rearrangements are caused by breakage of DNA double helices at two
+different locations. The broken ends in turn rejoin to produce a new chromosomal
+arrangement. Complex rearrangements involving more than two breaks are
+frequently observed in cancer genomes. Certain modifications need to be made to
+the VCF standard to adequately represent such variations in a VCF file. A
+detailed specification of the proposed extensions to describe rearrangements in
+a VCF file is available
+[here](http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41).
+Figure 2 illustrates some of the concepts relevant to VCF records for complex
+rearrangements.
+
+**Figure 2: Adjacencies and breakends in a chromosomal rearrangement** (adapted
+from VCF 4.1 specification)
+
+| ![media](images/ccr_VCF.png) |
+| ------------------------------------------ |
+
+
+
+A VCF file has one line for each of the two breakends in an adjacency. Table 7
+provides a list of sub-fields that have been added to describe breakends. An
+INFO sub-field (**SVTYPE=BND**) is used to indicate a breakend record.
+Sub-fields MATEID and PARID are used to represent variant record IDs of
+corresponding mates and partners respectively.
+
+**Table 7: Fields added for breakends**
+
+| **Field:Sub-field** | **Description** | **Declaration in HEADER** | **Required** | **(Sample values in BODY)** |
+| ------------------- | --------------- | ------------------------- | ------------ | --------------------------- |
+| INFO:**SVTYPE** | Type of structural variant; SVTYPE is set to "BND" for breakend records | ##INFO= | Yes |
+| | | *SVTYPE=BND* | (SVTYPE=BND for breakend records) |
+| INFO:**MATEID** | ID of corresponding mate of the breakend record | ##INFO= | No |
+| | | *MATEID=bnd_U* | |
+| INFO:**PARID** | ID of corresponding partner of the breakend record | ##INFO= | No |
+| | | *PARID=bnd_V* | |
+| INFO:**EVENT** | ID of event associated to breakend | ##INFO= | No |
+| | | *EVENT=RR0* | |
+
+The specification for ALT field deviates from the standard format for breakend
+records. ALT field for a breakend record can be represented in four possible
+ways based on the type of replacement.
+
+REF ALT Description
+
+s t[p[ piece extending to the right of p is joined after t
+
+s t]p] reverse comp piece extending left of p is joined after t
+
+s ]p]t piece extending to the left of p is joined before t
+
+s [p[t reverse comp piece extending right of p is joined before t
+
+Legend:
+
+s: sequence of REF bases beginning at position POS
+
+t: sequence of bases that replaces "s"
+
+p: position of the breakend mate indicating the first mapped base that joins at
+the adjacency; represented as a string of the form "chr:pos"
+
+[]: square brackets indicate direction that the joined sequence continues in,
+starting from p
+
+RNA-Seq variants
+----------------
+
+VCF specifications have been extended to address expressed variants obtained
+from RNA-Seq. Features added for structural variants from genome/exome
+sequencing are applicable to RNA-Seq structural variants. However, RNA-Seq
+breakends are represented by setting **SVTYPE=FND** instead of BND (Table 8)
+since they can be different from those observed in DNA-Seq.
+
+**Table 8: Fields added for RNA-Seq variants**
+
+| **Field:Sub-field** | **Description** | **Declaration in HEADER** | **Required** |
+| ------------------- | --------------- | ------------------------- | ------------ |
+| INFO:**SVTYPE** | Type of structural variant; SVTYPE is set to "FND" for breakends associated with RNA-Seq | ##INFO= | Yes |
+| | | *SVTYPE=FND* | (required for RNA-Seq breakend records; SVTYPE=FND) |
+
+VCF files for RNA-Seq variants may include gene-related annotations. However,
+this is not a standard feature of VCF files as eventually all VCF variants will
+be annotated using information in Generic Annotation File (GAF). Additional INFO
+and FORMAT sub-fields have been included to describe the characteristics of
+expressed nucleotide variants (Table 8a).
+
+**Table 8a: Annotation fields added for RNA-Seq variants**
+
+| **Field:Sub-field** | **Description** | **Declaration in HEADER** | **Required** |
+| ------------------- | --------------- | ------------------------- | ------------ |
+| INFO:**SID** | Unique identifiers from the gene annotation source as specified in ##geneAnno; "unknown" should be used if identifier is not known; comma-separated list of IDs can be used if variant overlaps with multiple features | ##INFO= | No |
+| | | *SID=13,198* | |
+| INFO:**GENE** | HUGO gene symbol; "unknown" should be used when gene symbol is unknown; comma-separated list of genes can be used if variant overlaps with multiple transcripts/genes | ##INFO= | No |
+| | | *GENE=ERBB2,ERBB2* | |
+| INFO:**RGN** | Region where a nucleotide variant occurs in relation to a gene | ##INFO= | No |
+| | | *RGN=exon,3_utr* | |
+| INFO:**RE** | Flag to indicate if position is known to have RNA-edits occur | ##INFO= | No |
+| | | *RE* | |
+| FORMAT:**TE** | Translational effect of a nucleotide variant in a codon | ##FORMAT= | |
+| | | *MIS,NA* | |
+
+Including validation status in VCF file
+---------------------------------------
+
+Somatic variations are often validated using follow-up experiments to confirm
+the variant is not due to sequencing errors. Following points need to be
+considered while including validation status in VCF file:
+
+- A single VCF file will contain sequence data for a single case. The file
+ could be the result of merging calls from different centers so validation
+ can be performed on a set of variants reported in a merged VCF file.
+
+- Validation with secondary technology is performed after obtaining results
+ from primary sequencing method. Therefore, validation is a confirmation step
+ and may or may not be performed before a first-pass VCF file with all
+ candidate mutations is generated and submitted to the DCC.
+
+- A single mutation can be verified with multiple independent methods and the
+ results may or may not be in agreement.
+
+- If results from different methods are in conflict, the final validation
+ status of the variant call needs to be inferred based on available
+ information. This could be done manually or programatically.
+
+**Format validation**
+
+Since validation data is added as additional genotype/sample columns, the file
+will pass validation as long as all existing format rules are followed and
+header declarations are correct.
+
+**Sample TCGA VCF file with validation status**
+
+Line1 ##fileformat=VCFv4.1
+
+Line2 ##tcgaversion=1.1
+
+Line3 ##fileDate=20120205
+
+Line4 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
+
+Line5 ##FORMAT=
+
+Line6 ##FORMAT=
+
+Line7 ##FORMAT=
+
+Line8 ##INFO=
+
+Line9 ##FILTER=
+
+Line10
+##SAMPLE=
+
+Line11
+##SAMPLE=
+
+Line12
+##SAMPLE=
+
+Line13
+##SAMPLE=
+
+Line14
+##SAMPLE=
+
+Line15 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR NORMAL_454
+TUMOR_454 TUMOR_Sanger
+
+Line16 20 14370 var1 G A 29 PASS VLS=2 GT:GQ:SS 0/0:48:. 0/1:50:2 0/0:20:.
+0/1:20:2 0/1:.:2
+
+Line17 5 15000 var2 T C 35 PASS VLS=1 GT:GQ:SS 0/1:48:. 1/1:51:3 0/1:60:.
+0/1:50:1 0/1:13:1
+
+Line18 3 170089 var2 G T 30 PASS . GT:GQ:SS 0/1:48:. 0/1:51:1 .:.:. .:.:. .:.:.
+
+The format follows these guidelines:
+
+1. **Sample columns**
+
+ - An additional column is included for every line of evidence used for
+ validation. In the example above, tumor calls are verified with 454 and
+ Sanger sequencing and normal calls are validated with 454. Therefore, 3
+ genotype columns exist in addition to the NORMAL and TUMOR sequencing
+ calls obtained with the primary sequencing method.
+
+ - The validation platform name is appended to the original sample to
+ distinguish the validation results from primary sequencing.
+ _ is used in the example above.
+
+ - **Note**: \ can be obtained from DCC [Code Tables
+ Report](http://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm).
+ The ##SAMPLE meta-information line also includes a 'Platform' tag
+ where platform name is defined.
+
+ - Each new genotype column header added to the file (e.g., TUMOR_454,
+ TUMOR_Sanger) has to be defined in the header using the ##SAMPLE
+ meta-information line (e.g., Lines 13 and 14).
+
+ - As per VCF specification, the order of FORMAT sub-fields is defined by
+ the FORMAT column and all calls from primary and validation sequencing
+ should comply with this order.
+
+ - If a sub-field does not apply to any given validation call, it should be
+ assigned a missing value (".").
+
+2. **FORMAT sub-field "SS"**
+
+ - For any given tumor genotype call, sub-field SS indicates variant status
+ with respect to non-adjacent normal counterpart (0, 1, 2, 3, 4 or 5
+ based on whether the variant is wildtype, germline, somatic, LOH,
+ post-transcriptional modification, or unknown respectively). Therefore,
+ each tumor genotype call (primary and secondary sequencing) will have
+ its own corresponding SS sub-field.
+
+3. **INFO sub-field "VLS"**
+
+ - Sub-field VLS represents an inferred decision for a tumor genotype call
+ and is based on the calls obtained with validation. In the example
+ above, var1 shows a somatic call (SS=2) for the tumor sample based on
+ primary sequencing, and both validation methods confirm this call.
+ Therefore, the final validation status of var1 is a somatic variation
+ (VLS=2). However, var2 has a LOH variant in tumor sample (SS=3) based on
+ primary sequencing whereas both validation methods indicate that it is a
+ germline variant (SS=1). In such a case, "VLS" has to be inferred from
+ available information and could differ from the SS value assigned to the
+ tumor sample based on primary sequencing.
+
+Validation rules
+================
+
+At the minimum, every file needs to go through the checks listed below.
+Following is an example of a VCF file that shows certain violations cited in the
+listed validation steps. Please note that line numbers in the file segment below
+are added for illustration purposes alone and are not expected to be found in an
+actual VCF file.
+
+Line1 ##fileformat=VCFv4.1
+
+Line2 ##fileDate=20090805
+
+Line3 ##source=myImputationProgramV3.1
+
+Line4 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
+
+Line5 ##INFO=
+
+Line6 ##INFO=
+
+Line7 ##FORMAT=
+
+Line8 ##FORMAT=
+
+Line9 ##FORMAT=
+
+Line10 ##FORMAT=
+
+Line11 ##FILTER=
+
+Line12 ##FILTER=
+
+Line13 FILTER=
+
+Line14 ##ALT=
+
+Line15 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TCGA-02-0001-01
+TCGA-02-0001-02
+
+Line16 20 14370 var1 G A 29 q10 NS=2;DP=14 GT:GQ:DP 0|0:48 0|1:48:3
+
+Line17 19 15000 var2 G A 35 q10;s50 NS=2.5 GQ:GT 48:0|0 51:0|1
+
+Line18 19 16000 var3 C T 30 q10;s10 NS=2 GT:GQ:DP 0/2:48:3 0/1:51:4
+
+Line19 2 14477 rs123 C \