-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix missing vim genes #182
Conversation
This is because this is the one that is used by the resistance testing later
This is so that the report rendering is not completely stopped if reference_length is not set
33d00c2
to
635de3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Slight comment is that we have to update the implementation plan as we would need to update information in mlstfest:
https://github.com/search?q=repo%3AClinical-Genomics%2Fmlstfest%20spades&type=code
We send in assemblies, and we just have to change it so that we say it is SKESA and not spades
As @karlnyr noted that here were some differences in the MLST calling for a couple of validation samples that he had run with both Spades with the --careful flag, and Skesa, below follows a walkthough and analysis of the differences. The differences are created using the htmldiff tool run on the MicroSALT Typing reports of the Spades/--careful and Skesa runs respectively. When there are differences, the Spades/--careful results will have a strikethrough effect, and the Skesa version will be underlined. I have then gone through all the MLST sections in all the reports, and added screenshots of those that include differences, and added a comment to each: The samples are:
Find them below, but first a summary of the findings. SummaryIn summary, in all the findings below, except the A62 sample of MIC3557, the previous allele call had either a length or identity well below 100%, and can thus be argued to be based on subpar quality of data. The A62 sample of MIC3557 is the most problematic one, as arcC is here completely missing in the Skesa example. A follow-up analysis using untrimmed data, makes the arcC gene show up, with a really low identity though, indicating that this sample also likealy has a problem with data quality, something that is further strengthened by the fact that another MLST gene in the same sample has around 50% length only. ACC5551Case ...A9Comment: Note that all the differing alleles had a non-100% length in the previous (Spades/--careful) run, and thus the differences can be explained by subpar quality in the Spades/--careful assembly. ACC155791Case ...A5Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly. Case ...A6Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly. Case ...A8Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly. Case ...A9Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly. Case ...A10Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly. Case ...A12Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly. Case ...A13Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly. Case ...A13Comment: Note that the differing allele had a non-100% length in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly. Case ...A19Comment: Note that the differing allele had a non-100% length and identity in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly. MIC3557Case ...A61Comment: Note that all the differing alleles had at least non-100% length, and sometimes non-100% identity too, in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly. Case ...A62Comment: Note that most of the differing alleles had at least non-100% length, and sometimes non-100% identity too, in the previous (Spades/--careful) run, and thus the difference can be explained by subpar quality in the Spades/--careful assembly. The one exception is a missing arcC gene in the Skesa version. Based on the Skesa publication, this is probably due to some underlying quality problems in the data, that Skesa finds dubious enough not to assemble. We can also note that the tpi gene had very low length both before and after, under 60% which is way under the 90% threshold, so this whole sample is failed, indicating that there is probably some problems with the underlying data. Update: I tried running Skesa on the untrimmed data (with the This indicates that there is probably some problem with the underlying data in this sample (perhaps too few reads)? MIC4109Comment: No MLST difference found here! |
Description
Summary of the changes made:
This PR replaces the Spades assembler with Skesa, as it has been shown to produce generally higher quality assemblies and seems to work well with NovaSeq X data, thereby fixing multiple issues we have had with the microSALT pipeline.
The naming format used in the contigs fasta files are converted into the same format used by Spades immediately after assembly, so as not needing to change the scraping logic in microSALT.
Initial tests seem to show that assembly is also faster with Skesa.
This fixes issue #180
Primary function of PR
Testing
Test routine to verify the stability of the PR:
bash /home/proj/production/servers/resources/hasta.scilifelab.se/update-microsalt-stage.sh BRANCHNAME
us
source activate S_microSALT
export MICROSALT_CONFIG=/home/proj/dropbox/microSALT.json
microSALT analyse project MIC3109
microSALT analyse project MIC4107
microSALT analyse project MIC4109
microSALT analyse project ACC5551
Verify that the results for projects MIC3109, MIC4107, MIC4109 & ACC5551 are consistent with the results attached to AMSystem doc 1490, Microbial_WGS.xlsx
Test results
These are the results of the tests, and necessary conclusions, that prove the stability of the PR.
Sign-offs