Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing vim genes #182

Merged
merged 13 commits into from
Sep 20, 2024
Merged

Fix missing vim genes #182

merged 13 commits into from
Sep 20, 2024

Conversation

samuell
Copy link
Contributor

@samuell samuell commented Sep 13, 2024

Description

Summary of the changes made:

This PR replaces the Spades assembler with Skesa, as it has been shown to produce generally higher quality assemblies and seems to work well with NovaSeq X data, thereby fixing multiple issues we have had with the microSALT pipeline.

The naming format used in the contigs fasta files are converted into the same format used by Spades immediately after assembly, so as not needing to change the scraping logic in microSALT.

Initial tests seem to show that assembly is also faster with Skesa.

This fixes issue #180

Primary function of PR

  • Hotfix
  • Patch
  • Minor functionality improvement
  • New type of analysis
  • Backward-breaking functionality improvement
  • This change requires internal documents to be updated
  • This change requires another repository to be updated

Testing

Test routine to verify the stability of the PR:

  • bash /home/proj/production/servers/resources/hasta.scilifelab.se/update-microsalt-stage.sh BRANCHNAME
  • us
  • source activate S_microSALT
  • (SITUATIONAL) export MICROSALT_CONFIG=/home/proj/dropbox/microSALT.json
  • Select a relevant subset of the following:
  • microSALT analyse project MIC3109
  • microSALT analyse project MIC4107
  • microSALT analyse project MIC4109
  • microSALT analyse project ACC5551

Verify that the results for projects MIC3109, MIC4107, MIC4109 & ACC5551 are consistent with the results attached to AMSystem doc 1490, Microbial_WGS.xlsx

Test results

These are the results of the tests, and necessary conclusions, that prove the stability of the PR.

Sign-offs

@samuell samuell force-pushed the 180-fix-missing-vim-genes branch from 33d00c2 to 635de3c Compare September 16, 2024 14:05
@samuell samuell marked this pull request as ready for review September 16, 2024 14:33
@samuell samuell requested a review from a team as a code owner September 16, 2024 14:33
Copy link
Contributor

@karlnyr karlnyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Slight comment is that we have to update the implementation plan as we would need to update information in mlstfest:

https://github.com/search?q=repo%3AClinical-Genomics%2Fmlstfest%20spades&type=code

We send in assemblies, and we just have to change it so that we say it is SKESA and not spades

@samuell samuell merged commit edb9256 into rc400 Sep 20, 2024
1 check passed
@samuell samuell linked an issue Oct 10, 2024 that may be closed by this pull request
@samuell
Copy link
Contributor Author

samuell commented Oct 11, 2024

To show that the VIM gene shows up after the fix, here is the resistances for the sample named above, for the factualdodo case:

In MicroSALT 3.3.5, where we can not find any VIM gene in this sample:

image

In MicroSALT 4.0.0, though, we can see that blaVIM-4 shows up:

image

@samuell
Copy link
Contributor Author

samuell commented Oct 12, 2024

As @karlnyr noted that here were some differences in the MLST calling for a couple of validation samples that he had run with both Spades with the --careful flag, and Skesa, below follows a walkthough and analysis of the differences.

The differences are created using the htmldiff tool run on the MicroSALT Typing reports of the Spades/--careful and Skesa runs respectively. When there are differences, the Spades/--careful results will have a strikethrough effect, and the Skesa version will be underlined.

I have then gone through all the MLST sections in all the reports, and added screenshots of those that include differences, and added a comment to each:

The samples are:

  • ACC5551
  • ACC155791
  • MIC3557
  • MIC4109

Find them below, but first a summary of the findings.

Summary

In summary, in all the findings below, except the A62 sample of MIC3557, the previous allele call had either a length or identity well below 100%, and can thus be argued to be based on subpar quality of data.

The A62 sample of MIC3557 is the most problematic one, as arcC is here completely missing in the Skesa example. A follow-up analysis using untrimmed data, makes the arcC gene show up, with a really low identity though, indicating that this sample also likealy has a problem with data quality, something that is further strengthened by the fact that another MLST gene in the same sample has around 50% length only.


ACC5551

Case ...A9

image

Comment: Note that all the differing alleles had a non-100% length in the previous (Spades/--careful) run, and thus the differences can be explained by subpar quality in the Spades/--careful assembly.


ACC155791

Case ...A5

image

Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A6

image

Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A8

image

Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A9

image

Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A10

image

Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A12

image

Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A13

image

Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A13

image

Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A19

image

Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.


MIC3557

Case ...A61

image

Comment: Note that all the differing alleles had at least non-100% length, and sometimes non-100% identity too, in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A62

image

Comment: Note that most of the differing alleles had at least non-100% length, and sometimes non-100% identity too, in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

The one exception is a missing arcC gene in the Skesa version. Based on the Skesa publication, this is probably due to some underlying quality problems in the data, that Skesa finds dubious enough not to assemble.

We can also note that the tpi gene had very low length both before and after, under 60% which is way under the 90% threshold, so this whole sample is failed, indicating that there is probably some problems with the underlying data.

Update: I tried running Skesa on the untrimmed data (with the --untrimmed flag to microSALT), and then the arcC gene is found, but with a really low length of 59.31%:

image

This indicates that there is probably some problem with the underlying data in this sample (perhaps too few reads)?


MIC4109

Comment: No MLST difference found here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with reporting VIM resistance genes
2 participants