Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First stable version #1

Merged
merged 98 commits into from
Dec 10, 2024
Merged

First stable version #1

merged 98 commits into from
Dec 10, 2024

Conversation

g-pyxl
Copy link
Contributor

@g-pyxl g-pyxl commented Oct 22, 2024

Initial commit for review/testing.


This change is Reviewable

@rebeccahaines1 rebeccahaines1 self-assigned this Oct 24, 2024
Copy link
Contributor

@rebeccahaines1 rebeccahaines1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 22 of 55 files at r1.
Reviewable status: 22 of 55 files reviewed, 5 unresolved discussions (waiting on @g-pyxl)


README.md line 28 at r1 (raw file):

- Integration with PanelApp for importing gene lists from predefined panels
- Custom padding to gene regions
- Live edit regions of interest in results viewer

I think you said you'd removed this option


README.md line 35 at r1 (raw file):

- BED file management system with draft and published states

## Functionality schema

in this image there's a user action "modify" but I thought this had been removed?


README.md line 108 at r1 (raw file):

1. Build the Docker image:

docker build -t bed-file-generator .

specify tag (for Readme could just be generic) ??


README.md line 113 at r1 (raw file):

2. Run the container:

docker run -p 5000:5000 -e BED_GENERATOR_FLASK_KEY=<your_secret_key> bed-file-generator

again specify tag?


README.md line 152 at r1 (raw file):

- Uses the IGV.js library for genomic visualization
- Integrates with PanelApp for gene panel data
- Utilises Ensembl VEP and TARK RESTful APIs for metadata

is there any version information for these that should be noted here?

Copy link
Contributor

@rebeccahaines1 rebeccahaines1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 25 of 55 files at r1.
Reviewable status: 46 of 55 files reviewed, 13 unresolved discussions (waiting on @g-pyxl)


README.md line 151 at r1 (raw file):

- Developed by the Synnovis Bioinformatics Team @ Guy's and St Thomas' NHS Foundation Trust (2024)
- Uses the IGV.js library for genomic visualization
- Integrates with PanelApp for gene panel data

also version info for the panelapp API?


app/bed_generator/api.py line 207 at r1 (raw file):

                return [max(matching_grch37, key=lambda x: int(x['stable_id_version']))]

    # If no MANE transcripts or matching GRCh37 transcript, selects the transcript with the highest version number.

not sure about this logic (not that the code is wrong, but that this isn't necessarily the most appropriate choice). I think if the transcript hasn't been provided and there's no MANE this needs to be a decision made by a clinical scientist. As a minimum I think any transcripts selected by this logic should be flagged?


app/bed_generator/api.py line 370 at r1 (raw file):

        if len(unknown_features) > 1:
            for feature in unknown_features:
                feature['alert'] = f"Coordinate {coord} overlaps multiple unknown genes."

what would this actually mean? how can a feature overlap "multiple unknown genes"?


app/bed_generator/api.py line 413 at r1 (raw file):

                'version': panel['version'],
                'version_created': panel['version_created'],
                'genes': fetch_genes_for_panel(panel['id'], include_amber=True, include_red=True)

does the user have the option to exclude amber and red genes?


app/bed_generator/bed_generator.py line 32 at r1 (raw file):

        """
        loc_region = f"chr{r['loc_region']}" if add_chr_prefix else r['loc_region']
        loc_start = r['loc_start'] - padding

Just want to check that this is fine for reverse strand genes, i.e. that elsewhere the code is handling that start and end in that context might be flipped so needs to be handled first?


app/bed_generator/routes.py line 95 at r1 (raw file):

    print("Results:", results)
    print("assembly:", assembly)
    print("initial_query:", initial_query)  # Debug print

is this debug statement still needed?


app/bed_generator/routes.py line 120 at r1 (raw file):

    # Recalculate loc_start and loc_end based on new padding
    for result in results:
        result['loc_start'] = int(result['original_loc_start']) - padding_5

same comment as before re padding- assume reverse strand gene start and stop coords are handled so Start isn't > than stop?


app/bed_generator/routes.py line 197 at r1 (raw file):

@bed_generator_bp.route('/submit_for_review', methods=['POST'])
def submit_for_review():
    try:

This is a very long try/except statement which is a bit hard to follow, especially with all the next ifs/fors. If it's working I'm happy for it to be left for now, but could an issue be added to the repo to consider reviewing/refining this in a future update?

Copy link
Contributor Author

@g-pyxl g-pyxl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 37 of 56 files reviewed, 13 unresolved discussions (waiting on @rebeccahaines1)


README.md line 28 at r1 (raw file):

Previously, rebeccahaines1 wrote…

I think you said you'd removed this option

README updated to reflect this.


README.md line 35 at r1 (raw file):

Previously, rebeccahaines1 wrote…

in this image there's a user action "modify" but I thought this had been removed?

Have updated the schema to remove this.


README.md line 108 at r1 (raw file):

Previously, rebeccahaines1 wrote…

specify tag (for Readme could just be generic) ??

README updated for this.


README.md line 113 at r1 (raw file):

Previously, rebeccahaines1 wrote…

again specify tag?

README updated.


README.md line 151 at r1 (raw file):

Previously, rebeccahaines1 wrote…

also version info for the panelapp API?

README updated to include versions.


README.md line 152 at r1 (raw file):

Previously, rebeccahaines1 wrote…

is there any version information for these that should be noted here?

Have also added versions used for these tools.


app/bed_generator/api.py line 207 at r1 (raw file):

Previously, rebeccahaines1 wrote…

not sure about this logic (not that the code is wrong, but that this isn't necessarily the most appropriate choice). I think if the transcript hasn't been provided and there's no MANE this needs to be a decision made by a clinical scientist. As a minimum I think any transcripts selected by this logic should be flagged?

I have added a new field within the results that informs the user of logical decisions made in selecting transcripts. I have also added a check within the bed review process that informs the user that a clinical review must be made for any non-default (MANE) transcript selections within the BED file. User now has to review and tick a box accepting that this has been reviewed before submission will be accepted.


app/bed_generator/api.py line 370 at r1 (raw file):

Previously, rebeccahaines1 wrote…

what would this actually mean? how can a feature overlap "multiple unknown genes"?

The only scenario that this will pick up is where a significantly large genomic region in coordinates is used as a query. This is very much an edge case, but the VEP overlap endpoint may return multiple "unknown" features if the region queried is large enough. I think the term "uncharacterised genomic regions" as opposed to genes is probably more appropriate and have updated to reflect this.


app/bed_generator/api.py line 413 at r1 (raw file):

Previously, rebeccahaines1 wrote…

does the user have the option to exclude amber and red genes?

These are excluded from the frontend by default. The approach here is to retrieve the data in the backend to avoid performing secondary API calls if the user chooses to include amber/red genes.


app/bed_generator/bed_generator.py line 32 at r1 (raw file):

Previously, rebeccahaines1 wrote…

Just want to check that this is fine for reverse strand genes, i.e. that elsewhere the code is handling that start and end in that context might be flipped so needs to be handled first?

Have included screenshots within another reply - padding is applied based on strand.


app/bed_generator/routes.py line 95 at r1 (raw file):

Previously, rebeccahaines1 wrote…

is this debug statement still needed?

No longer needed and removed in latest commit.


app/bed_generator/routes.py line 120 at r1 (raw file):

Previously, rebeccahaines1 wrote…

same comment as before re padding- assume reverse strand gene start and stop coords are handled so Start isn't > than stop?

Have adjusted the logic to simplify this and handle in routes. Forward and reverse are handled as you've described and have performed some double checks with a mixed range. Attaching image:

  • Query using APOE (forward) and BRCA1 (reverse)
  • 50bp padding has been applied to 5'.
  • Padding applied discretely based on identifier strand direction within the same session
    strandtesting.png

app/bed_generator/routes.py line 197 at r1 (raw file):

Previously, rebeccahaines1 wrote…

This is a very long try/except statement which is a bit hard to follow, especially with all the next ifs/fors. If it's working I'm happy for it to be left for now, but could an issue be added to the repo to consider reviewing/refining this in a future update?

Agreed. I've refactored this down into smaller, (hopefully) easier to follow functions in relevant modules.

Copy link
Contributor

@rebeccahaines1 rebeccahaines1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 55 files at r1, 14 of 14 files at r3, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @g-pyxl)


app/bed_generator/panels.json line 0 at r3 (raw file):
this is a dump of panelapp data. please can it be removed before this branch is merged in to main


app/bed_generator/utils.py line 234 at r3 (raw file):

        panel_id: The ID of the panel.
    """
    genes = fetch_genes_for_panel(panel_id, include_amber=True, include_red=True)

is it clear to the user from the input page that amber and red genes will be includes and/or could/should there be an option to exclude them? Option to exclude can be a future development.


app/bed_manager/routes.py line 33 at r3 (raw file):

@login_required
def submit_bed_file():
    # Logic for submitting a new BED file

? is something missing here?


app/bed_manager/routes.py line 39 at r3 (raw file):

@login_required
def review_bed_file(file_id):
    # Logic for reviewing a BED file

see line 33


app/static/js/results.js line 12 at r3 (raw file):

        settings = data;
        // Set the default assembly in the dropdown
        document.getElementById('genome-select').value = settings.defaultAssembly === 'GRCh38' ? 'hg38' : 'hg19';

? GRCh37 ?

Copy link
Contributor Author

@g-pyxl g-pyxl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last commit will ideally remove as much redundancy as possible, and will update the README to include some end-to-end testing from this latest version (including using generated BEDs with some of our pipeline components to compare outputs).

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @rebeccahaines1)


app/bed_generator/panels.json line at r3 (raw file):

Previously, rebeccahaines1 wrote…

this is a dump of panelapp data. please can it be removed before this branch is merged in to main

Removed.


app/bed_generator/utils.py line 234 at r3 (raw file):

Previously, rebeccahaines1 wrote…

is it clear to the user from the input page that amber and red genes will be includes and/or could/should there be an option to exclude them? Option to exclude can be a future development.

They are excluded by default, but stored in the session query to prevent the need for additional API calls if they user chooses to include these. A default selection would look like this:

image.png

And when an option is ticked:

image copy 1.png

Unticking will remove these genes again.


app/bed_manager/routes.py line 33 at r3 (raw file):

Previously, rebeccahaines1 wrote…

? is something missing here?

Old logic, this was changed and function no longer needed. Removed.


app/bed_manager/routes.py line 39 at r3 (raw file):

Previously, rebeccahaines1 wrote…

see line 33

Old logic, this was changed and function no longer needed. Removed.


app/static/js/results.js line 12 at r3 (raw file):

Previously, rebeccahaines1 wrote…

? GRCh37 ?

This is the genome selection for IGV. The options available to IGV are 'hg38' and 'hg19'.

The code here is storing the initially query from the frontend as settings. If the assembly selected and passed from the initial query == "GRCh38", set IGV to 'hg38'. If anything else, set IGV to 'hg19' by default.

Copy link
Contributor

@rebeccahaines1 rebeccahaines1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @g-pyxl)


app/bed_manager/routes.py line 33 at r3 (raw file):

Previously, g-pyxl (George Doyle) wrote…

Old logic, this was changed and function no longer needed. Removed.

ok. please can you delete from the coed to avoid future confusion


app/bed_manager/routes.py line 39 at r3 (raw file):

Previously, g-pyxl (George Doyle) wrote…

Old logic, this was changed and function no longer needed. Removed.

see line 33 above

@g-pyxl
Copy link
Contributor Author

g-pyxl commented Dec 5, 2024

Merged new changes from MVP branch.

Copy link
Contributor

@rebeccahaines1 rebeccahaines1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 7 files at r5, 3 of 8 files at r6, 39 of 39 files at r7, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @g-pyxl)


app/bed_generator/transcript_processing.py line 2 at r7 (raw file):

"""
transcript_processing.py - Functions for processing transcript data.

I think all of these functions are duplicated from app/bed_generator/api.py


app/bed_generator/panels.json line 0 at r7 (raw file):
I think this file needs deleting


app/bed_generator/routes.py line 151 at r6 (raw file):

        # Check both original coordinates and rsID presence
        is_variant = (
            int(result['original_loc_start']) == int(result['original_loc_end']) or

is this assuming a variant is always a single base change? what about indels?


app/bed_generator/api.py line 4 at r7 (raw file):

api.py - Fetches and processes variant information from Ensembl, TARK, and PanelApp APIs.

Functions:

this list is incomplete- please can it be reviewed


app/bed_generator/genes.json line 0 at r7 (raw file):
Think this file needs to be deleted?

Copy link
Contributor Author

@g-pyxl g-pyxl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main update accounts for padding to variants over 1bp. Various files removed, docstrings updated.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @rebeccahaines1)


app/bed_generator/api.py line 4 at r7 (raw file):

Previously, rebeccahaines1 wrote…

this list is incomplete- please can it be reviewed

Updated all module docstrings


app/bed_generator/genes.json line at r7 (raw file):

Previously, rebeccahaines1 wrote…

Think this file needs to be deleted?

Deleted


app/bed_generator/panels.json line at r7 (raw file):

Previously, rebeccahaines1 wrote…

I think this file needs deleting

Deleted


app/bed_generator/routes.py line 151 at r6 (raw file):

Previously, rebeccahaines1 wrote…

is this assuming a variant is always a single base change? what about indels?

I have applied the following handler for this in routes.py to account for longer variants. Tested using various indel identifiers, downloads and db entries both applying correct padding for these cases.    

Handle SNPs

    if entry.get('is_snp', False) or entry.get('rsid'):

        if snp_padding:

            # For multi-bp SNPs, apply padding to start and end directly

            if entry['loc_start'] != entry['loc_end']:

                result['loc_start'] = int(entry['loc_start']) - snp_padding

                result['loc_end'] = int(entry['loc_end']) + snp_padding

            else:

                # For single-bp SNPs, calculate from center point

                center = int(entry['loc_start'])

                result['loc_start'] = center - snp_padding

                result['loc_end'] = center + snp_padding

        return result


app/bed_generator/transcript_processing.py line 2 at r7 (raw file):

Previously, rebeccahaines1 wrote…

I think all of these functions are duplicated from app/bed_generator/api.py

Removed - was part of a refactoring branch and accidentally included

Copy link
Contributor

@rebeccahaines1 rebeccahaines1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 12 files at r8, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @g-pyxl)

@rebeccahaines1 rebeccahaines1 merged commit 7a41b56 into main Dec 10, 2024
@rebeccahaines1 rebeccahaines1 deleted the dev branch December 10, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants