-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Welcome to the project! #1
Comments
Many thanks:) |
What is our input? A set of contigs (also, what type of assembly tool & sequencing data is used?) and PB reads? Are the reads CCS? Overall, the project seems like it will probably be straightforward if we use existing methods:
Do we have test datasets with a ground truth (known reference genome)? Or should we generate simulated datasets first? If we are generating test datasets we could use PBsim. It is less clear what we can use for simulating a set of contigs (biases in gaps depends on the type of technology used). |
Thanks @DCGenomics to initiate the discussion! And yes, I think gitter or slack would be great for team communication I am relatively new to this field. May I ask is the goal of the project to develop a pipeline to glue existing de novo assembly softwares for bacterial genomes? Would the pipeline be assessed by its accuracy or speed (or both)? |
@JustinChu Great questions! That's also what I wanted to ask :-) |
@syhackseq2016 Unless we have performance issues where it takes more than an hour to run our code, I'd say we shouldn't worry too much about performance. We can optimize after we get a pipeline working and generating decent results. For evaluating assemblies commonly Quast is commonly used. We might also want to generate a dotplot or something to show we at least scaffold correctly. |
@JustinChu Thanks! Just found this post under hackseq/hackseq_projects_2016#1 about some extra info about our project. |
Greetings, Great questions so far. Most of my experience is with 16S rRNA amplicons, and lately I've been focusing on reproducible pipeline practices. I have experience with GNU Make, Snakemake, and I'm eager to learn/work with docker. I'm familiar with slack and google groups for communication, but I'm open to any tool 😀. I'll arrive in town late on the 14th and would be free to meet after that. Jess |
@hochoy Though I'm in Vancouver, many members of the team are not going to be in Vancouver until right before ASHG. Alternatively, we could have those members meet up with us electronically. @syhackseq2016 Thanks for that link. It is a good resource though it sort of brings in more questions than answers. They seem to be bringing up some assembly algorithms (that use either pure PacBio or a hybrid of PacBio and Illumina) in the discussion, but the project description says genome closing. I think this could be resolved once the project lead (@DCGenomics) lets us know what out expected input/output is. I've noticed people are introducing themselves a bit here. My name is (unsurprisingly) Justin and I am a PhD student in the Bioinformatics Technology Lab at the Genome Science Centre (GSC). I work on algorithm development for sequence classification, de novo assembly and other sequence analysis tasks. Though the GSC mostly deals in Illumina sequencing, I've worked a bit with long read technology (mostly Oxford nanopore but some PacBio as well). I work mostly in C++, R, Perl and Python. I also have experience with make and also recommend make for our initial pipeline. I shamefully haven't really heard of snakemake until now but upon looking at it now I think learning it could be very useful. Justin |
With PacBio SMRT reads as input, this is what I have in mind - PacBio SMRT reads --> Hierarchial genome assembly process(HGAP) with end trimming and bestn < a coverage threshold(~20X) --> minimus2 to connect contigs --> Quiver for polishing (SMRT sequencing reads and the initial de novo assembly are the inputs to Quiver) -->FGAP to close gaps --> trim one end of the self-similar ends for each contig owing to the circular nature of bacterial plasmids and genomes--> Quast to evaluate the assembly. 'Gepard' dotplotting tool for dotplots. Going with @JustinChu suggestion, we could use 'make' for our pipeline. Related papers and materials to look into - BTW, My name is Hamza Khan and I am a MSc student in the CIHR Bioinformatics training program at UBC. I work at the Bioinformatics Technology Lab(BTL) at the Genome Scienc Centre in Vancouver on designing and implementing algorithms for sequence analysis. I choose Python over other languages for my day to day work, though most of my projects are in C++. I use R for plotting/data visualisation. Cheers! |
Good evening team! Your enthusiasm is awesome, and it makes me think there is a high likelihood of not only finishing a useful software product, but also sending out a cool manuscript. That said, in my opinion, what makes a hackathon like this strong, is a bunch of diverse opinions pushing and pulling to create a software product that converts clever algorithms to make an easy to use pipeline covering as many use cases as possible. I suspect that about half of us could take three days and hack something together that would work well for a few use cases, but I would be willing to bet a Dom perignon to a miller lite that what we can come up with collectively could be much better. The original use case was "I am a biologist and I have tried to close a bacterial genome with some short reads. It didn't work (repeats, high gc, etc), and I'm thinking of sending it out for long read sequencing, but I want to know that when I get the reads back I can reassemble with high accuracy". That said, there are a ton of caveats to which tools to use for this. Here's a specific example: say those long reads came from a pac bio-having core. Depending on how much pac bio one got back, one might consider different assemblers. Obviously, there are folks on our team with a ton of experience in this space. Also, the use case above isn't the only one in this space. There are lots of other use cases centered around bacterial [meta]genome assembly with short and long reads. Taken collectively, my opinion is that this should be fairly technically straightforward. We can likely make an outline the first morning and be hammering the crap out of Amazon's servers mid-day 2. What would make this software great is to make it as useful as possible. So, what I propose is that our homework for the next two weeks is to go and talk to our friends and see what they want to do in this space and use that to build a master spec sheet based on that, such that more people will use and hopefully contribute to this repo. Cheers! PS -- I'm getting some real sample data from a colleague at a place that does some incredible sequencing work. Perhaps other folks have people who would like to give them data for this effort? |
I've created a Slack at hackseq.slack.com, created a channel |
Hi, David. I've sent you another invitation. The first invitation was sent to your e-mail address at prostatecentre.com |
Thanks, Shaun! |
(also, could you tell me which email address you used for me?) Cheers! Ben |
I sent the invitation to you at nih.gov |
Hi, if anyone is not on the slack channel yet, please jump on! I've set up a couple of planning docs, at it would be awesome to get a list of things we are going to want on the AWS nodes in advance (theres already a bunch of stuff on this string)! You guys are awesome! Ben |
Pilon is popular. It's intended to correct variants. KAT compares the k-mer histogram of the reads to that of the assembly. "REAPR is a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison." RAMPART is an assembly pipeline that includes an evaluation step. |
For those of you who are new to this space, perhaps start here:
https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads
For those who are experienced, perhaps mention other relevant tools in this string?
I know this is a bit simplistic, but I would like to get the discussion going!
Also, I'll likely set up a gitter or google group for us to pass info in the next few days. Comments?
Cheers!
The text was updated successfully, but these errors were encountered: