Testing environment #14

dkoslicki · 2020-02-06T20:44:15Z

Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including:

adding unit tests to the tests folder (lots can be copied from CMash/MinHash.py)
add additional unit tests as code coverage is rather low at this point
test validity of Query module results
automate script tests so manual checking is no longer required.
brute force calculate containment indicies and check that results of all scripts are within acceptable error ranges.
add tests for the multiple k-mer size features in test_scripts
Set up Travis CI

The text was updated successfully, but these errors were encountered:

… passing at this point. #14

…n MakeStreamingDNADatabase.py and fixed (empty kmer issue).

…of truncating and taking rev-comps

…verse complements: seq2 and seq4 should return 1 all the way through, yet they don't. #14 #2

…ing non-completely full sketches properly. #14 #2

…ep. #14

…y. Implemented, and new test added as well #24 #14

…master #24

dkoslicki · 2020-03-23T22:01:04Z

Some unit tests are in. MinHash module tests check validity of results. Query module is only really checking for code-breaking errors at this point, as there are a lot of FIXME's and TODO's.

Will need to:

test validity of Query module results
automate script tests so manual checking is no longer required.
brute force calculate containment indicies and check that results of all scripts are within acceptable error ranges.

Will be tagging this as help wanted and assigning everyone, since all are welcome to contribute.

SOP: create new branch:

git checkout master
git pull origin master  # make sure code is up to date
git checkout -b <some_feature_branch_name>  # create a new branch implementing a new testing feature
# add your new feature
git commit -a  # commit your contributions
git push origin <some_feature_branch_name>  # push your changes to your feature branch
# then request a code review before merging to master

dkoslicki · 2020-03-23T22:02:41Z

Note: while I assigned all, this is mainly a QOL (quality of life) issue: things that will make our future contributions easier in the future, but should not distract from main projects. i.e. as time permits.

… testing, and other testing issues. Further work on #14 will happen here.

…for auto-testing scripts. #14

… make things cleaner, prep for DataFrame export

… not __ for functions in pool.map() #14

dkoslicki · 2020-03-27T05:30:08Z

@dkoslicki Make sure GroundTruth.py is identifying kmers and rc-kmers, not counting them as distinct.

dkoslicki · 2020-03-27T05:52:38Z

./run_small_tests.sh
,k=10,k=12,k=14,k=16,k=18,k=20
taxid_1192839_4_genomic.fna.gz,1.0,1.0,1.0,1.0,1.0,1.0
taxid_28901_877_genomic.fna.gz,1.0,0.786,0.416,0.332,0.294,0.274

Ground truth
on server since takes a fair bit of memory
import CMash.GroundTruth as G
query_file="/data/dmk333/repos/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz"
training_file="/data/dmk333/repos/CMash/tests/script_tests/TrainingDatabase.h5"
g = G.TrueContainment(training_file, "10-21-2")
df = g.return_containment_data_frame(query_file, -1, .1)
print(df)
k=10 k=12 k=14 k=16 k=18 k=20
taxid_1192839_4_genomic.fna.gz 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
taxid_28901_877_genomic.fna.gz 0.970794 0.648166 0.404911 0.336364 0.303958 0.279067

Well that looks pretty nice to me!

dkoslicki · 2020-03-27T15:54:04Z

Switched to canonical k-mers to sanity check things, results basically unchanged:
Ground truth
on server since takes a fair bit of memory
k=10 k=12 k=14 k=16 k=18 k=20
taxid_1192839_4_genomic.fna.gz 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000
taxid_28901_877_genomic.fna.gz 0.970735 0.64816 0.404912 0.336364 0.303959 0.279068

So we'll be sticking with canonical k-mers for the ground truth as it's much more straightforward to understand.

…, and ground truth from #14

…all it just like StreamingQueryDNADatabase.py #14

dkoslicki · 2020-03-27T20:41:16Z

Note to self @dkoslicki: something odd is happening at small k-mer sizes: using run_comparison_to_ground_truth.sh via GroundTruth.py, in __return_containment_index:

return len(set1.intersection(set2)) / float(len(set1))

seems correct, but

return len(set1.intersection(set2)) / float(len(set2))

returns accurate small k-mer size results...
eg.

import CMash.GroundTruth as G
training_database_file = "/home/dkoslicki/Desktop/CMash/tests/script_tests/TrainingDatabase.h5"
query_file1 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz"
query_file2 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_562_8705_genomic.fna.gz"
g = G.TrueContainment(training_database_file, "4-6-1")
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file1][4]))
1.0
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file2][4]))
0.3056179775280899

And the StreamingQueryDNADatabase.py is returning a 1 (not the 0.3056).
Clearly, query_file2 is basically three copies of query_file1 at k=4, but why ok results at higher k-mer sizes?

Oh yeah, and StreamingQueryDNADatabase.py uses a heck of a lot of memory for small k-mer sizes. Probably khmer or screed's fault, but that's TBD.

… make sure that the installation worked correctly (i.e. don't call with python from the scripts directory, instead call the Bioconda installed version) #14

dkoslicki · 2020-03-27T21:56:15Z

Regarding direction of containment, I think the committed way is best:
set1 as denom

Total error per k-mer size:
k=8     0.043016
k=10    0.354925
k=12    2.485572
k=14    0.690597
k=16    0.161794
k=18    0.076439
k=20    0.035385
k=22    0.008816
dtype: float64

set2 as denom:

Total error per k-mer size:
k=8     0.168598
k=10    2.173376
k=12    3.924027
k=14    0.832073
k=16    0.140583
k=18    0.047191
k=20    0.009990
k=22    0.018207
dtype: float64

But clearly something is up with k=12. Odd...
This is using run_comparison_to_ground_truth.sh with:

testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="8-${maxK}-2"
numHashes=10000
containmentThresh=0
locationOfThresh=-1

But clearly something is up with k=12. Odd...
|true-CMash|:

genome	k=8	k=10	k=12	k=14	k=16	k=18	k=20	k=22
taxid_1192839_4_genomic.fna.gz	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000e+00
taxid_1307_414_genomic.fna.gz	0.000761	0.049504	0.257595	0.048064	0.003751	0.000108	2.923078e-07	3.920249e-05
taxid_1311_236_genomic.fna.gz	0.001250	0.050466	0.278542	0.050275	0.003800	0.000344	7.373714e-05	2.379639e-05
taxid_1759312_genomic.fna.gz	0.000639	0.034805	0.260666	0.058385	0.005820	0.000915	2.118953e-04	1.701001e-04
taxid_2026799_87_genomic.fna.gz	0.000761	0.045469	0.262687	0.055671	0.004839	0.000117	9.684609e-06	5.321341e-05
taxid_2041488_genomic.fna.gz	0.000067	0.024380	0.216208	0.039611	0.003973	0.000272	7.260736e-05	1.380767e-04
taxid_28901_877_genomic.fna.gz	0.000608	0.029265	0.257055	0.151288	0.081336	0.052041	2.663244e-02	6.756324e-03
taxid_554168_genomic.fna.gz	0.001172	0.043717	0.304057	0.059005	0.005468	0.000548	1.954086e-05	8.476430e-07
taxid_562_8705_genomic.fna.gz	0.027607	0.039054	0.315867	0.110230	0.026908	0.012463	4.603500e-03	1.080697e-03
taxid_573_36_genomic.fna.gz	0.010152	0.038264	0.332896	0.118068	0.025898	0.009632	3.761221e-03	5.540108e-04

Now to test on a "real" metagenome...

dkoslicki · 2020-03-27T22:12:55Z

And note, the problem appears to only be at k=12:
with

testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="14-${maxK}-1"
numHashes=10000
containmentThresh=0
locationOfThresh=-1

we get

… and Fixes #22

dkoslicki · 2020-04-06T19:41:25Z

Will create new issue for ground truth containment computation so it will be easier to track progress on this.

dkoslicki added the good first issue label Feb 7, 2020

This was referenced Mar 19, 2020

Figure out what to do with the reverse complements #2

Closed

Multiple k-mer sizes bug #19

Closed

Modularize StreamingQueryDNADatabase.py #24

Closed

dkoslicki added a commit that referenced this issue Mar 19, 2020

Expanded tests folder. Added unit tests. Added script tests. All test…

d77fb2a

… passing at this point. #14

dkoslicki added a commit that referenced this issue Mar 19, 2020

forgot to actually commit the unit tests for MinHash #14

9ae7c1a

dkoslicki added a commit that referenced this issue Mar 19, 2020

add skeleton for unit tests of Query.py module #14

1eb10d3

dkoslicki added a commit that referenced this issue Mar 20, 2020

making progress on unit tests for Query. #14 already identified bug i…

57f346d

…n MakeStreamingDNADatabase.py and fixed (empty kmer issue).

dkoslicki added a commit that referenced this issue Mar 20, 2020

added a bunch of unit tests for #14. Still some concerns about order …

4915e29

…of truncating and taking rev-comps

dkoslicki added a commit that referenced this issue Mar 20, 2020

Well, this is the point of testing, something is still up with the re…

51494e0

…verse complements: seq2 and seq4 should return 1 all the way through, yet they don't. #14 #2

dkoslicki added a commit that referenced this issue Mar 20, 2020

simple enough fix! Revcomps *are* handled properly, just wasn't handl…

69db213

…ing non-completely full sketches properly. #14 #2

dkoslicki added a commit that referenced this issue Mar 20, 2020

make further todo note for #14

d945315

dkoslicki added a commit that referenced this issue Mar 20, 2020

modified small test as I had mistakenly commented out the training st…

fe786bd

…ep. #14

dkoslicki added a commit that referenced this issue Mar 20, 2020

almost forgot to incorporate new BF into the MakeStreamingPrefilter.p…

ab32572

…y. Implemented, and new test added as well #24 #14

dkoslicki added a commit that referenced this issue Mar 23, 2020

remove print statement cruft from tests for #14 in prep for merge to …

0ab6271

…master #24

dkoslicki assigned dkoslicki, ShaopengLiu1, IsaacT1123 and x-zang Mar 23, 2020

dkoslicki added a commit that referenced this issue Mar 25, 2020

create a dev folder that will describe how to do local testing, conda…

f5bbcf7

… testing, and other testing issues. Further work on #14 will happen here.

dkoslicki added a commit that referenced this issue Mar 27, 2020

start making progress in computing ground truth containment indicies …

48b45ae

…for auto-testing scripts. #14

dkoslicki referenced this issue Mar 27, 2020

parallelize the ground truth, auto-populate with protected classes to…

7555e49

… make things cleaner, prep for DataFrame export

dkoslicki added a commit that referenced this issue Mar 27, 2020

GroundTruth: everything works in serial, now to debug parallel. #14

1bf1b13

dkoslicki added a commit that referenced this issue Mar 27, 2020

private classes aren't pickle-able, so just make it protected (i.e. _…

993537b

… not __ for functions in pool.map() #14

dkoslicki added a commit that referenced this issue Mar 27, 2020

bit of local testing #14

d116a53

dkoslicki added a commit that referenced this issue Mar 27, 2020

added a note about where to do the canonical k-mers #14

92c2f2a

dkoslicki added a commit that referenced this issue Mar 27, 2020

switch to using canonical k-mers #14

1002eb5

dkoslicki added a commit that referenced this issue Mar 27, 2020

and also don't start more python processes than necessary. #14

b8d6ded

dkoslicki added a commit that referenced this issue Mar 27, 2020

Merge remote-tracking branch 'origin/tests' to include tests, readmes…

bc145d7

…, and ground truth from #14

dkoslicki mentioned this issue Mar 27, 2020

Multiple k-mer sizes confirmation and testing #20

Open

6 tasks

dkoslicki added a commit that referenced this issue Mar 27, 2020

add doc strings, bug fixes, and turn into a function too so you can c…

0a557cc

…all it just like StreamingQueryDNADatabase.py #14

dkoslicki added a commit that referenced this issue Mar 27, 2020

create script to generate estimate and ground truth outputs #14

9a94f77

dkoslicki added a commit that referenced this issue Mar 27, 2020

add a little python script on the bottom to quantify all the error #14

7d4a573

dkoslicki added a commit that referenced this issue Mar 27, 2020

replace params with realistic ones for use on the server. #14

b320bd6

dkoslicki added a commit that referenced this issue Mar 27, 2020

add test that demonstrates that #22 is a real issue. #14

87d2725

dkoslicki added a commit that referenced this issue Mar 27, 2020

fixed issue #22 via #14

c4427a9

dkoslicki added a commit that referenced this issue Mar 27, 2020

Merge remote-tracking branch 'origin/tests' to pick up changes from #14…

9881dc5

… and Fixes #22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing environment #14

Testing environment #14

dkoslicki commented Feb 6, 2020 •

edited

Loading

dkoslicki commented Mar 23, 2020

dkoslicki commented Mar 23, 2020

dkoslicki commented Mar 27, 2020

dkoslicki commented Mar 27, 2020

dkoslicki commented Mar 27, 2020

dkoslicki commented Mar 27, 2020 •

edited

Loading

dkoslicki commented Mar 27, 2020 •

edited

Loading

dkoslicki commented Mar 27, 2020

dkoslicki commented Apr 6, 2020

Testing environment #14

Testing environment #14

Comments

dkoslicki commented Feb 6, 2020 • edited Loading

dkoslicki commented Mar 23, 2020

dkoslicki commented Mar 23, 2020

dkoslicki commented Mar 27, 2020

dkoslicki commented Mar 27, 2020

dkoslicki commented Mar 27, 2020

dkoslicki commented Mar 27, 2020 • edited Loading

dkoslicki commented Mar 27, 2020 • edited Loading

dkoslicki commented Mar 27, 2020

dkoslicki commented Apr 6, 2020

dkoslicki commented Feb 6, 2020 •

edited

Loading

dkoslicki commented Mar 27, 2020 •

edited

Loading

dkoslicki commented Mar 27, 2020 •

edited

Loading