SeqFindR - easily create informative genomic feature plots.
SeqFindR is nearing a stable API. From release 0.2, SeqFindR will primarily undergo bug fixes and feature enhancement.
We have only tested SeqFindR on linux systems. There has been some success with MacOSX.
Please see the changelog for most recent changes/fixes/enhancements.
- You'll need to install/have installed:
- ncbiblast >= 2.2.27
- python >= 2.7 (Python 3 is not supported)
- pip
- Freetype (font engine). May be required to build Matplotlib.
You can check these are installed by:
$ python --version $ which blastn $ which pip
The following python libraries will be (may be) installed automatically if you follow the installation instructions detailed below.
- We use the following python libraries:
- numpy >= 1.6.1
- scipy >= 0.10.1
- matplotlib >= 1.1.0
- biopython >= 1.59
- ghalton>=0.6
The state of python packaging is that bad you could miss many nights sleep. I'm looking at you SciPy. For the smoothest possible install we recommend installing the requirements using your distributions package manager.
For Ubuntu:
sudo apt-get install python-numpy python-scipy python-matplotlib python-biopython ncbi-blast+ python-dev python-pip
If you're a member of the Beatson Group you'll already have the SeqFindR script in your barrine $PATH. You do not need to install SeqFindR. UQ based researchers should email me ([email protected]) for the location of SeqFindR.
Option 1a (with root/admin):
$ pip install SeqFindR
Option 1b (as a standard user):
$ pip install SeqFindR --user
You'll need to have git installed for the following alternative install options. git can be really useful for scientists. See here for some discussion.
Option 2a (with root/admin & git):
$ cd ~/ $ git clone git://github.com/mscook/SeqFindR.git $ cd SeqFindR $ sudo python setup.py install
Option 2b (standard user & git) replacing INSTALL/HERE with appropriate:
$ cd ~/ $ git clone git://github.com/mscook/SeqFindR.git $ cd SeqFindR $ echo 'export PYTHONPATH=$PYTHONPATH:~/INSTALL/HERE/lib/python2.7/site-packages' >> ~/.bashrc $ echo 'export PATH=$PATH:~/INSTALL/HERE/bin' >> ~/.bashrc $ source ~/.bashrc $ python setup.py install --prefix=~/INSTALL/HERE/SeqFindR/
If the install went correctly:
$ which SeqFindR /INSTALLED/HERE/bin/SeqFindR $ SeqFindR -h
Please regularly check back to make sure you're running the most recent SeqFindR version. You can upgrade like this:
If installed using option 1x:
$ pip install --upgrade SeqFindR $ # or $ pip install --upgrade SeqFindR --user
If installed using option 2x:
$ cd ~/SeqFindR $ git pull $ sudo python setup.py install $ or $ cd ~/SeqFindR $ git pull $ echo 'export PYTHONPATH=$PYTHONPATH:~/INSTALL/HERE/lib/python2.7/site-packages' >> ~/.bashrc $ echo 'export PATH=$PATH:~/INSTALL/HERE/bin' >> ~/.bashrc $ source ~/.bashrc $ python setup.py install --prefix=~/INSTALL/HERE/SeqFindR/
SeqFindR CU fimbriae genes image. 110 E. coli strains were investigated. Order is according to phylogenetic analysis. Black blocks represent gene presence.
The SeqFindR database is in multi-fasta format. The header needs to be formatted with 4 comma separated elements.
- The elements headers are:
- identifier,
- common name,
- description and
- species
The final element, separated by [] contains a classification.
An example:
>70-tem8674, bla-TEM, Beta-lactams Antibiotic resistance (ampicillin), Unknown sp. [Beta-lactams] AAAGTTCTGCTATGTGGCGCGGTATTATCCCGTGTTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATAC >70-shv86, bla-SHV, Beta-lactams Antibiotic resistance (ampicillin), Unknown sp. [Beta-lactams] CTCAAGCGGCTGCGGGCTGGCGTGTACCGCCAGCGGCAGGGTGGCTAACAGGGAGATAATACACAGGCGA >70-oxa(1)256, bla-OXA-1, Beta-lactams Antibiotic resistance (ampicillin), Unknown sp. [Beta-lactams] >70-tetB190, tet(B), Tetracycline Antibiotic resistance (tetracycline), Unknown sp. [Tetracycline] CAAAGTGGTTAGCGATATCTTCCGAAGCAATAAATTCACGTAATAACGTTGGCAAGACTGGCATGATAAG
The script vfdb_to_seqfindr is now included in SeqFindR to convert VFDB formatted files (or like) to SeqFindR formatted database files.
VFDB: Virulence Factors Database (www.mgc.ac.cn/VFs/) is a reference database for bacterial virulence factors.
At this stage we have tested this script on limited internal datasets. Success/mileage will depend on the consistency of the VFDB formatting.
Example usage of vfdb_to_seqfindr:
# Default (will set VFDB classification identifiers as the classification) $ vfdb_to_seqfindr -i TOTAL_Strep_VFs.fas -o TOTAL_Strep_VFs.sqf # Sets any classification to blank ([ ]) $ vfdb_to_seqfindr -i TOTAL_Strep_VFs.fas -o TOTAL_Strep_VFs.sqf -b # Reads a user defined classification. 1 per in same order as input # sequences $ python convert_vfdb_to_SeqFindR.py -i TOTAL_Strep_VFs.fas -o TOTAL_Strep_VFs.sqf -c user.class
The -c (--class_file) option is very useful. Suppose you want to annotate your sequences of interest with user defined classification values. Simply develop a file containing the scheme as pass using the -c option (3rd example above). A sample file for the situation where you had 7 input sequences with the first 3 Fe transporters, the next two Toxins, the next a Misc and the final sequence is a Toxin would look like this:
Fe transporter Fe transporter Fe transporter Toxin Toxin Misc Toxin
We use the following calculation:
hsp.identities/float(record.query_length) >= tol
- Where:
- hsp.identities is number of identities in the high-scoring pairs between the query (databse entry) and subject (contig/scaffold/mapping consensus),
- record.query_length is the length of the database entry and,
- tol is the cutoff threshold to accept a hit (0.95 default)
- Why not just use max identity?
- Eliminate effects of scaffolding characters/gaps,
- Handle poor coverage etc. in mapping consensuses where N characters/gaps may be introduced
What problems may this approach cause? I'm still looking into it...
SeqFindR can read a configuartion file. At the moment you can only redefine the category colors (suppose you want to use a set of fixed colors instead of the deault randomly generated). The configuration file is expected to expand in the future.
To define category colors:
touch ~/.SeqFindR.cfg vi ~/.SeqFindR.cfg # Add something like category_colors = [(100,60,201), (255,0,99)]
Category colors can be any RGB triplet. You could use a tool similar to this one: http://www.colorschemer.com/online.html
For example the first row of colors in RGB is: (51,102,255), (102,51,255), (204,51,255), (255,51,204)
In some cases you may want to screen using PCR primers. Please use the --short option. Here we adjust BLASTn paramaters wordsize = 7 & Expect Value = 1000
Note: The version 0.2 API has changed. Please fimilarise yourself with the changes. We also now provide a script to run all the examples. Note: We have changed the color generation code. As a consequence the backgound colors will be different when running this yourself. The results will not change.
- Navigate to the SeqFindR/example directory (from git clone). The following files should be present:
- A database file called Antibiotic_markers.fa
- A ordering file called dummy.order (-i option)
- An assemblies directory containing strain1.fa, strain2.fa and strain3.fa
- A consensus directory containing strain1.fa, strain2.fa and strain3.fa (-m option)
- The toy assemblies and consesuses were generated such that:
- strain1 was missing: 70-shv86, 70-ctx143 and 70-aac3(IV)380 with mis-assembly of 70-aphA(1)1310 & 70-tem8674
- strain2 was missing: 70-oxa(7)295, 70-pse(4)348 70-ctx143, 70-aadA1588, 70-aadB1778 and 70-aacC(2)200
- strain2 was missing 70-shv86, 70-ctx143 and 70-aac3(IV)380 with mis-assembly of 70-aphA(1)1310, 70-tem8674 and 70-aadA1588
Something like this:
$ #Assuming you git cloned, python setup.py install $ cd SeqFindR/example $ ./run_examples.sh $ #See directories run1/ run2/ run3/ run4/
Command:
SeqFindR Antibiotic_markers.fa assemblies/ -o run1 -l
Link to full size run1.
Command:
SeqFindR Antibiotic_markers.fa assemblies/ -m consensus/ -o run2 -l
Link to full size run2.
Command:
SeqFindR Antibiotic_markers.fa assemblies/ -m consensus/ -o run3 -l -r
Link to full size run3.
The clustering dendrogram looks like this:
Link to full size dendrogram.
Command:
SeqFindR Antibiotic_markers.fa assemblies/ -m consensus/ -o run4 -l -r --index_file dummy.order
Link to full size run4.
We strongly recommend that you use mapping consensus data. It minimises the effects of missassembly and collapsed repeats.
We use Nesoni. We use the database file (in multi-fasta format) as the reference for mapping. Nesoni has no issues with multifasta files as references (BWA will treat them as separate chomosomes). The workflow is something like this:
$ nesoni make-reference myref ref-sequences.fa $ # for each strain $ nesoni analyse-sample: mysample myref pairs: reads1.fastq reads2.fastq
For those of you using a cluster running PBSPro see: https://github.com/mscook/SeqFindR_nesoni This is a script that generates a job array, submits and cleans up the maping results ready for input to SeqFindR.
The output from the described workflow and SeqFindR_nesoni is a consensus.fa file which we term the mapping consensus. This file is a multi-fasta file of the consensus base calls relative to the database sequences.
- Caveats:
- you will probably want to allow multi-mapping reads (giving --monogamous no --random yes to nesoni consensus) (this is default for SeqFindR_nesoni),
- The (poor) alignment of reads at the start and the end of the database genes can result in N calls. This can result in downstream false negatives.
As of version 0.2 of SeqFindR we now provide a solution to minimise the effects of poor mapping at the start and end of the given sequences.
The SeqFindR option is -s or --STRIP:
-s STRIP, --strip STRIP Strip the 1st and last N bases of mapping consensuses & database [default = 10]
By default this strips the 1st and last 10 bases from the mapping consensuses. We have had good results with this value. Feel free to experiment with different values (say, -s 0, -s 5, -s 10, -s 15). Please see image-compare a script we developed to compare the effects of different values of -s on the resultant figures.
See the help listing. You can get this yourself with:
$ SeqFindR -h
Please see the TODO for future SeqFindR project directions.