References
-
the orginal DESeq publication by Anders and Huber
-
DESeq documentation on Biocoductor website
Note that in the future we are likely to transition to DESeq2
R --vanilla --args --table=example.tab --header="YES" --rownamesCol=1 --treatCountCols=2,3 --contrCountCols=4,5 --outfiBaseTab=example --outfiBasePNG=example --filterNoPolya "YES" < DESEQ/deseq.r
The example input file example.tab and
as well as the output files are saved in Data/
directory of this repository.
INPUT The top of the input file with counts of reads per gene in from four experiments (two control replicates and two treatment replicates) example.tab file:
$ head example.tab
Gene cont_rep1 cont_rep2 treat_rep1 treat_rep2
128up 1742 1675 1603 312
14-3-3epsilon 5164 5478 5885 940
14-3-3zeta 250 575 481 129
140up 223 282 219 54
18w 27 25 65 13
26-29-p 9634 9026 10666 1750
2mit 0 0 4 0
312 123 264 230 66
4EHP 52 50 70 10
OUTPUT Following four files are generated:
- example.deseq.Results_all.tab
this is a standard DESeq ouput table. Note how rows with
Inf
andNA
values are skipped usingcat
andgrep
for viewing the table
$ cat example.deseq.Results_all.tab | grep -v -P '\tInf\t' | grep -v -P '\tNA\t' | head
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
CG12655 5.52461827927325 0.310844928378475 10.738391630168 34.5458157743957 5.11043907533905 0.209850294873236 1
MtnC 16.4025209776703 0.932534785135426 31.8725071702052 34.1783573956188 5.09501115795936 0.475634174481844 1
CG4630 5.0114623846732 0.310844928378475 9.71207984096792 31.2441315727139 4.96551333599042 0.246396570329004 1
Amy-d 21.3660182286171 1.75309866697042 40.9789377902638 23.3751462837404 4.54690348842952 0.321955305030729 1
Sdic4 3.638867490771 0.310844928378475 6.96689005316353 22.41275123743 4.48624785007082 0.529797611756606 1
CG17192 3.57033394671122 0.310844928378475 6.82982296504396 21.9718011828978 4.45758123734407 0.584648893407203 1
CG5778 3.57033394671122 0.310844928378475 6.82982296504396 21.9718011828978 4.45758123734407 0.584648893407203 1
CG4363 18.7099759644333 1.75309866697042 35.6668532618962 20.3450347284407 4.34660483794369 0.333426563456943 1
CG30108 6.50521297073474 0.621689856756951 12.3887360847125 19.9275184403658 4.31669015846702 0.237003124886577 1
-
example.deseq.dispersion.png plot of disperition estimates
-
example.deseq.diffexpres.png plot of differential expression estimates
-
example.deseq.Results_all.tab.SKIPPED_IDS file with gene names that were sckipped from the analysis
$ head example.deseq.Results_all.tab.SKIPPED_IDS
mtRNApol
5.8SrRNA
LSU-rRNA_Dme|Protostomia
LSU-rRNA_Hsa|Metazoa
SSU-rRNA_Dme|Protostomia
SSU-rRNA_Hsa|Metazoa
l(2)not
l(3)neo18
l(3)neo38
l(3)neo43
R --vanilla --args --table=<file.tab>
--header=<YES|NO>
--rownamesCol=<N>
--fitType=<paramteric|local>
--method=<pooled|blind>
--sharingMode=<maximum|fit-only>
--treatCountCols=<1-based column numbers, csv>
--contrCountCols=<1-based column numbers, csv>
--outfiBaseTAB=<path/nameBase>
--outfiBasePDF=<path/nameBase>
--filterNoPolya
--table : file with first column as row names and two columns
providing counts (values will be rounded to integers)
--header : "YES" or "NO" (default "NO"; header isn't important
as such, only important for correct reading of the table)
--rownamesCol : column number for the column that provides rownames
--treatCountCols : <1-based column numbers, csv>
--contrCountCols : <1-based column numbers, csv>
--method : "pooled" or "blind" method of fitting dispersion model.
Use "pooled" if you have replicates, "blind" if you
do not have replicates. Default "pooled"
--sharingMode : "maximum" or "fit-only" mode of fitting dispersion
model. Use "maximum" if you have replicates,
"fit-only" if you do not have replicates. Default: "maximum"
--fitType : "parametric" or "local" mode of fitting dispersion model.
"parametric" (default) is recommended. But it doesn't
work sometimes for unknown reasons, suggested alternative
"local". Default: "parametric"
--outfiBaseTAB : path/nameBase of for the output files:
${base}.deseq.Results_all.tab
--outfiBasePNG : path/nameBase of for output PNGs. Two will be written:
${base}.deseq.dispersion.png
${base}.deseq.diffexpres.png
--filterNoPolya : set to "YES" if to remove certain gene names from the
table (those that are matched by "toFilter") by reg exp
matching to gene names:
"tRNA" "rRNA" "\\)n"
"4.5S" "5S" "7S" "_rich",
"\\|U\\d+" "snoRNA" "mir-"
NOTE: more filtering is currently commented out inside of the script: rows
that have 0-value in all columns are not included into the analysis. Use of these
filtering steps will results with Inf
and NA
values
Thanks to Junho Hur for providing the data file example.tab