Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing environment #14

Open
3 of 7 tasks
dkoslicki opened this issue Feb 6, 2020 · 9 comments
Open
3 of 7 tasks

Testing environment #14

dkoslicki opened this issue Feb 6, 2020 · 9 comments

Comments

@dkoslicki
Copy link
Owner

dkoslicki commented Feb 6, 2020

Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including:

  • adding unit tests to the tests folder (lots can be copied from CMash/MinHash.py)
  • add additional unit tests as code coverage is rather low at this point
  • test validity of Query module results
  • automate script tests so manual checking is no longer required.
  • brute force calculate containment indicies and check that results of all scripts are within acceptable error ranges.
  • add tests for the multiple k-mer size features in test_scripts
  • Set up Travis CI
dkoslicki added a commit that referenced this issue Mar 19, 2020
dkoslicki added a commit that referenced this issue Mar 20, 2020
…n MakeStreamingDNADatabase.py and fixed (empty kmer issue).
dkoslicki added a commit that referenced this issue Mar 20, 2020
dkoslicki added a commit that referenced this issue Mar 20, 2020
…verse complements: seq2 and seq4 should return 1 all the way through, yet they don't. #14 #2
dkoslicki added a commit that referenced this issue Mar 20, 2020
dkoslicki added a commit that referenced this issue Mar 20, 2020
dkoslicki added a commit that referenced this issue Mar 20, 2020
@dkoslicki
Copy link
Owner Author

Some unit tests are in. MinHash module tests check validity of results. Query module is only really checking for code-breaking errors at this point, as there are a lot of FIXME's and TODO's.

Will need to:

  • test validity of Query module results
  • automate script tests so manual checking is no longer required.
  • brute force calculate containment indicies and check that results of all scripts are within acceptable error ranges.

Will be tagging this as help wanted and assigning everyone, since all are welcome to contribute.

SOP: create new branch:

git checkout master
git pull origin master  # make sure code is up to date
git checkout -b <some_feature_branch_name>  # create a new branch implementing a new testing feature
# add your new feature
git commit -a  # commit your contributions
git push origin <some_feature_branch_name>  # push your changes to your feature branch
# then request a code review before merging to master

@dkoslicki
Copy link
Owner Author

Note: while I assigned all, this is mainly a QOL (quality of life) issue: things that will make our future contributions easier in the future, but should not distract from main projects. i.e. as time permits.

dkoslicki added a commit that referenced this issue Mar 25, 2020
… testing, and other testing issues. Further work on #14 will happen here.
dkoslicki added a commit that referenced this issue Mar 27, 2020
dkoslicki referenced this issue Mar 27, 2020
… make things cleaner, prep for DataFrame export
dkoslicki added a commit that referenced this issue Mar 27, 2020
@dkoslicki
Copy link
Owner Author

@dkoslicki Make sure GroundTruth.py is identifying kmers and rc-kmers, not counting them as distinct.

dkoslicki added a commit that referenced this issue Mar 27, 2020
@dkoslicki
Copy link
Owner Author

./run_small_tests.sh
,k=10,k=12,k=14,k=16,k=18,k=20
taxid_1192839_4_genomic.fna.gz,1.0,1.0,1.0,1.0,1.0,1.0
taxid_28901_877_genomic.fna.gz,1.0,0.786,0.416,0.332,0.294,0.274

Ground truth
on server since takes a fair bit of memory
import CMash.GroundTruth as G
query_file="/data/dmk333/repos/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz"
training_file="/data/dmk333/repos/CMash/tests/script_tests/TrainingDatabase.h5"
g = G.TrueContainment(training_file, "10-21-2")
df = g.return_containment_data_frame(query_file, -1, .1)
print(df)
k=10 k=12 k=14 k=16 k=18 k=20
taxid_1192839_4_genomic.fna.gz 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
taxid_28901_877_genomic.fna.gz 0.970794 0.648166 0.404911 0.336364 0.303958 0.279067

Well that looks pretty nice to me!

@dkoslicki
Copy link
Owner Author

Switched to canonical k-mers to sanity check things, results basically unchanged:
Ground truth
on server since takes a fair bit of memory
k=10 k=12 k=14 k=16 k=18 k=20
taxid_1192839_4_genomic.fna.gz 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000
taxid_28901_877_genomic.fna.gz 0.970735 0.64816 0.404912 0.336364 0.303959 0.279068

So we'll be sticking with canonical k-mers for the ground truth as it's much more straightforward to understand.

dkoslicki added a commit that referenced this issue Mar 27, 2020
dkoslicki added a commit that referenced this issue Mar 27, 2020
…all it just like StreamingQueryDNADatabase.py #14
@dkoslicki
Copy link
Owner Author

dkoslicki commented Mar 27, 2020

Note to self @dkoslicki: something odd is happening at small k-mer sizes: using run_comparison_to_ground_truth.sh via GroundTruth.py, in __return_containment_index:

return len(set1.intersection(set2)) / float(len(set1))

seems correct, but

return len(set1.intersection(set2)) / float(len(set2))

returns accurate small k-mer size results...
eg.

import CMash.GroundTruth as G
training_database_file = "/home/dkoslicki/Desktop/CMash/tests/script_tests/TrainingDatabase.h5"
query_file1 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz"
query_file2 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_562_8705_genomic.fna.gz"
g = G.TrueContainment(training_database_file, "4-6-1")
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file1][4]))
1.0
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file2][4]))
0.3056179775280899

And the StreamingQueryDNADatabase.py is returning a 1 (not the 0.3056).
Clearly, query_file2 is basically three copies of query_file1 at k=4, but why ok results at higher k-mer sizes?

Oh yeah, and StreamingQueryDNADatabase.py uses a heck of a lot of memory for small k-mer sizes. Probably khmer or screed's fault, but that's TBD.

dkoslicki added a commit that referenced this issue Mar 27, 2020
… make sure that the installation worked correctly (i.e. don't call with python from the scripts directory, instead call the Bioconda installed version) #14
@dkoslicki
Copy link
Owner Author

dkoslicki commented Mar 27, 2020

Regarding direction of containment, I think the committed way is best:
set1 as denom

Total error per k-mer size:
k=8     0.043016
k=10    0.354925
k=12    2.485572
k=14    0.690597
k=16    0.161794
k=18    0.076439
k=20    0.035385
k=22    0.008816
dtype: float64

set2 as denom:

Total error per k-mer size:
k=8     0.168598
k=10    2.173376
k=12    3.924027
k=14    0.832073
k=16    0.140583
k=18    0.047191
k=20    0.009990
k=22    0.018207
dtype: float64

But clearly something is up with k=12. Odd...
This is using run_comparison_to_ground_truth.sh with:

testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="8-${maxK}-2"
numHashes=10000
containmentThresh=0
locationOfThresh=-1

But clearly something is up with k=12. Odd...
|true-CMash|:

genome k=8 k=10 k=12 k=14 k=16 k=18 k=20 k=22
taxid_1192839_4_genomic.fna.gz 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000e+00
taxid_1307_414_genomic.fna.gz 0.000761 0.049504 0.257595 0.048064 0.003751 0.000108 2.923078e-07 3.920249e-05
taxid_1311_236_genomic.fna.gz 0.001250 0.050466 0.278542 0.050275 0.003800 0.000344 7.373714e-05 2.379639e-05
taxid_1759312_genomic.fna.gz 0.000639 0.034805 0.260666 0.058385 0.005820 0.000915 2.118953e-04 1.701001e-04
taxid_2026799_87_genomic.fna.gz 0.000761 0.045469 0.262687 0.055671 0.004839 0.000117 9.684609e-06 5.321341e-05
taxid_2041488_genomic.fna.gz 0.000067 0.024380 0.216208 0.039611 0.003973 0.000272 7.260736e-05 1.380767e-04
taxid_28901_877_genomic.fna.gz 0.000608 0.029265 0.257055 0.151288 0.081336 0.052041 2.663244e-02 6.756324e-03
taxid_554168_genomic.fna.gz 0.001172 0.043717 0.304057 0.059005 0.005468 0.000548 1.954086e-05 8.476430e-07
taxid_562_8705_genomic.fna.gz 0.027607 0.039054 0.315867 0.110230 0.026908 0.012463 4.603500e-03 1.080697e-03
taxid_573_36_genomic.fna.gz 0.010152 0.038264 0.332896 0.118068 0.025898 0.009632 3.761221e-03 5.540108e-04

Screenshot 2020-03-27 18 06 28

Now to test on a "real" metagenome...

@dkoslicki
Copy link
Owner Author

And note, the problem appears to only be at k=12:
with

testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="14-${maxK}-1"
numHashes=10000
containmentThresh=0
locationOfThresh=-1

we get
Screenshot 2020-03-27 18 10 24

@dkoslicki
Copy link
Owner Author

Will create new issue for ground truth containment computation so it will be easier to track progress on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants