Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation / software bug for building custom database #11

Open
taltman opened this issue Apr 23, 2020 · 1 comment
Open

Documentation / software bug for building custom database #11

taltman opened this issue Apr 23, 2020 · 1 comment

Comments

@taltman
Copy link

taltman commented Apr 23, 2020

Hi LMAT team!

I have been trying to follow the documentation in lmat-doc.txt to build a custom database for use with LMAT. I've been having issues doing so. I'll try to document a specific case here.

One step in the process of building a custom database is constructing a mapping file between NCBI Taxonomy Database identifiers and the full deflines from the multi-FASTA formatted file containing the reference sequences. I've followed the documentation below in that regard:

 The mapping is specified as a tab delimited file with the first column containing the tax id and the second
 column should contain the header associated with sequence stored in the input fasta file (WORK/test.fa below)
 For example:
 418127   >ref|NC_009782.1|gnl|NCBI_GENOMES|21340|gi|156978331|Staphylococcus aureus subsp. aureus Mu3, complete genome

When I provide my constructed GenomeToTaxID.txt file to build_header_table.py, it breaks:

reading: /media/ephemeral/taltman/lmat/GenomeToTaxID.txt
Traceback (most recent call last):
  File "./build_header_table.py", line 44, in <module>
    gi_to_tid[t[4]] = t[0]
IndexError: list index out of range

Poking into the Python script, it seems to be expecting a file with at least five columns, not two. Changing t[4] to t[1] seems to fix it.

So, either there is a documentation bug, or there is a software bug.

Any feedback would be greatly appreciated. Thanks!

@jeallen
Copy link
Member

jeallen commented Apr 23, 2020

Hello, It's possible that this script was merged with another version that used the following convoluted formatting as follows:
Taxonomy id, taxonomy id, -1, otherid, header
as you can see there is a lot of redundancy here. I don't think this is needed.

It appears to me the best option would be to make the change: gi_to_tid[t[1]] = t[0] and use your original format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants