Fix missing vim genes #182

samuell · 2024-09-13T10:47:28Z

Description

Summary of the changes made:

This PR replaces the Spades assembler with Skesa, as it has been shown to produce generally higher quality assemblies and seems to work well with NovaSeq X data, thereby fixing multiple issues we have had with the microSALT pipeline.

The naming format used in the contigs fasta files are converted into the same format used by Spades immediately after assembly, so as not needing to change the scraping logic in microSALT.

Initial tests seem to show that assembly is also faster with Skesa.

This fixes issue #180

Primary function of PR

Hotfix
Patch
Minor functionality improvement
New type of analysis
Backward-breaking functionality improvement
This change requires internal documents to be updated
This change requires another repository to be updated

Testing

Test routine to verify the stability of the PR:

bash /home/proj/production/servers/resources/hasta.scilifelab.se/update-microsalt-stage.sh BRANCHNAME
us
source activate S_microSALT
(SITUATIONAL) export MICROSALT_CONFIG=/home/proj/dropbox/microSALT.json
Select a relevant subset of the following:
microSALT analyse project MIC3109
microSALT analyse project MIC4107
microSALT analyse project MIC4109
microSALT analyse project ACC5551

Verify that the results for projects MIC3109, MIC4107, MIC4109 & ACC5551 are consistent with the results attached to AMSystem doc 1490, Microbial_WGS.xlsx

Test results

These are the results of the tests, and necessary conclusions, that prove the stability of the PR.

Sign-offs

Code tested by @octocat
Approved to run at Clinical-Genomics by @karlnyr

This is because this is the one that is used by the resistance testing later

This is so that the report rendering is not completely stopped if reference_length is not set

karlnyr

This looks great! Slight comment is that we have to update the implementation plan as we would need to update information in mlstfest:

https://github.com/search?q=repo%3AClinical-Genomics%2Fmlstfest%20spades&type=code

We send in assemblies, and we just have to change it so that we say it is SKESA and not spades

samuell · 2024-10-11T09:49:27Z

To show that the VIM gene shows up after the fix, here is the resistances for the sample named above, for the factualdodo case:

In MicroSALT 3.3.5, where we can not find any VIM gene in this sample:

In MicroSALT 4.0.0, though, we can see that blaVIM-4 shows up:

samuell · 2024-10-12T11:55:48Z

As @karlnyr noted that here were some differences in the MLST calling for a couple of validation samples that he had run with both Spades with the --careful flag, and Skesa, below follows a walkthough and analysis of the differences.

The differences are created using the htmldiff tool run on the MicroSALT Typing reports of the Spades/--careful and Skesa runs respectively. When there are differences, the Spades/--careful results will have a strikethrough effect, and the Skesa version will be underlined.

I have then gone through all the MLST sections in all the reports, and added screenshots of those that include differences, and added a comment to each:

The samples are:

ACC5551
ACC155791
MIC3557
MIC4109

Find them below, but first a summary of the findings.

Summary

In summary, in all the findings below, except the A62 sample of MIC3557, the previous allele call had either a length or identity well below 100%, and can thus be argued to be based on subpar quality of data.

The A62 sample of MIC3557 is the most problematic one, as arcC is here completely missing in the Skesa example. A follow-up analysis using untrimmed data, makes the arcC gene show up, with a really low identity though, indicating that this sample also likealy has a problem with data quality, something that is further strengthened by the fact that another MLST gene in the same sample has around 50% length only.

ACC5551

Case ...A9

Comment: Note that all the differing alleles had a non-100% length in the previous (Spades/--careful) run, and thus the differences can be explained by subpar quality in the Spades/--careful assembly.

ACC155791

Case ...A5

Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A6

Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A8

Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A9

Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A10

Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A12

Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A13

Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A13

Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A19

Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

MIC3557

Case ...A61

Comment: Note that all the differing alleles had at least non-100% length, and sometimes non-100% identity too, in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

Case ...A62

Comment: Note that most of the differing alleles had at least non-100% length, and sometimes non-100% identity too, in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly.

The one exception is a missing arcC gene in the Skesa version. Based on the Skesa publication, this is probably due to some underlying quality problems in the data, that Skesa finds dubious enough not to assemble.

We can also note that the tpi gene had very low length both before and after, under 60% which is way under the 90% threshold, so this whole sample is failed, indicating that there is probably some problems with the underlying data.

Update: I tried running Skesa on the untrimmed data (with the --untrimmed flag to microSALT), and then the arcC gene is found, but with a really low length of 59.31%:

This indicates that there is probably some problem with the underlying data in this sample (perhaps too few reads)?

MIC4109

Comment: No MLST difference found here!

samuell added 12 commits August 22, 2024 16:12

Test for existance of some specific VIM genes

70a4c5d

Fix code that would delete blast matches with identical coverage

676044c

Replace Spades assembler with SKESA

a0b1f30

Replace blaVIM test data with output from SKESA

0bfbf16

Simplify creation of contigs file

3925d75

Convert sequence naming in Skesa contigs fasta file into Spades format

b2d4ace

Fix f-string backslash error

526e583

Fix it even more

be8c73a

Use original file name for main contig file

7b60a26

This is because this is the one that is used by the resistance testing later

Fix bug in skesa output file path

4b3092e

Update testdata to use Skesa output with Spades format

1059d5a

Check if reference_length is set before showing

635de3c

This is so that the report rendering is not completely stopped if reference_length is not set

samuell force-pushed the 180-fix-missing-vim-genes branch from 33d00c2 to 635de3c Compare September 16, 2024 14:05

samuell marked this pull request as ready for review September 16, 2024 14:33

samuell requested a review from a team as a code owner September 16, 2024 14:33

karlnyr approved these changes Sep 18, 2024

View reviewed changes

Remove internal file paths

4c0a3a2

samuell merged commit edb9256 into rc400 Sep 20, 2024
1 check passed

samuell linked an issue Oct 10, 2024 that may be closed by this pull request

Issue with reporting VIM resistance genes #180

Open

samuell mentioned this pull request Nov 5, 2024

Issue with reporting VIM resistance genes #180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing vim genes #182

Fix missing vim genes #182

samuell commented Sep 13, 2024 •

edited

Loading

karlnyr left a comment

samuell commented Oct 11, 2024

samuell commented Oct 12, 2024 •

edited

Loading

Fix missing vim genes #182

Fix missing vim genes #182

Conversation

samuell commented Sep 13, 2024 • edited Loading

Description

Primary function of PR

Testing

Test results

Sign-offs

karlnyr left a comment

Choose a reason for hiding this comment

samuell commented Oct 11, 2024

samuell commented Oct 12, 2024 • edited Loading

Summary

ACC5551

Case ...A9

ACC155791

Case ...A5

Case ...A6

Case ...A8

Case ...A9

Case ...A10

Case ...A12

Case ...A13

Case ...A13

Case ...A19

MIC3557

Case ...A61

Case ...A62

MIC4109

samuell commented Sep 13, 2024 •

edited

Loading

samuell commented Oct 12, 2024 •

edited

Loading