Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTF filter error in 3.13.2 but works in 3.12.0 #1147

Closed
heathfuqua opened this issue Dec 28, 2023 · 10 comments
Closed

GTF filter error in 3.13.2 but works in 3.12.0 #1147

heathfuqua opened this issue Dec 28, 2023 · 10 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@heathfuqua
Copy link

Description of the bug

When running zebrafish samples in rnaseq pipeline on 3.13.2, failure occurs at gtf_filter job. Failure does not occur with all same settings on 3.12.0.

Command used and terminal output

nextflow run nf-core/rnaseq \
		 -r 3.13.2 \
		 -profile docker \
		 -c nextflow.config \
		 -params-file rnaseq.params.json \
		 -with-tower

Error executing process > 'NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF_FILTER (Danio_rerio.GRCz11.dna.fa)'

Caused by:
  Essential container in task exited

Command executed:

  filter_gtf.py \
      --gtf Danio_rerio.GRCz11.110.gtf \
      --fasta Danio_rerio.GRCz11.dna.fa \
      --prefix Danio_rerio.GRCz11.dna
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF_FILTER":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "//nextflow-bin/filter_gtf.py", line 73, in <module>
      filter_gtf(args.fasta, args.gtf, args.prefix + ".filtered.gtf", args.skip_transcript_id_check)
    File "//nextflow-bin/filter_gtf.py", line 32, in filter_gtf
      if tab_delimited(gtf_in) != 8:
    File "//nextflow-bin/filter_gtf.py", line 26, in tab_delimited
      data = f.read(102400)
    File "/usr/local/lib/python3.9/codecs.py", line 322, in decode
      (result, consumed) = self._buffer_decode(data, self.errors, final)
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Relevant files

rnaseq.params.json

System information

Nextflow version: 23.04.3
Hardware: AWS cloud
Executor: awsbatch
Container engine: Docker
Version of nf-core/rnaseq 3.13.2

@heathfuqua heathfuqua added the bug Something isn't working label Dec 28, 2023
@drpatelh
Copy link
Member

drpatelh commented Jan 3, 2024

Hi @heathfuqua ! Be great if you can provide us with links to where we can download the GTF and Fasta files so we can reproduce please? Looks like an encoding issue but be good to confirm.

@drpatelh drpatelh added this to the 3.13.3 milestone Jan 3, 2024
@MatthiasZepper
Copy link
Member

Yes, that GTF sanity check is a recent addition in version 3.13 of the pipeline, so it is comprehensible that version 3.12 runs without issues.

Since the decode error occurs in position 1 and the invalid byte happens to be 0x8b, I am confident that your gtf file is still compressed (gzip's magic number is 0x1f 0x8b).

Just gzip -d Danio_rerio.GRCz11.110.gtf and you should be set.

@pinin4fjords
Copy link
Member

(or rename the file to suffix it with .gz so that the pipeline recognises the GTF as compressed and uncompresses it)

@heathfuqua
Copy link
Author

Thanks for your help.
That does make sense given the invalid byte and position, but I just double-checked the file (headed it) and it's plain text. The reason I initially unzipped it is that the gzipped version was failing with the below error. It may be of note that with the unzipped file the full 3.12.0 pipeline completed.

Error executing process > 'NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GUNZIP_GTF (Danio_rerio.GRCz11.110.gtf.gz)'

--

Caused by:
Process NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GUNZIP_GTF (Danio_rerio.GRCz11.110.gtf.gz) terminated with an error exit status (1)

Command executed:

gunzip
-f

Danio_rerio.GRCz11.110.gtf.gz

cat <<-END_VERSIONS > versions.yml
"NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GUNZIP_GTF":
gunzip: $(echo $(gunzip --version 2>&1) | sed 's/^.(gzip) //; s/ Copyright.$//')
END_VERSIONS

Command exit status:
1

--

I'd gotten a similar error with an unzipped fasta file. See below. This file I also verified as unzipped just now.

The exit status of the task that caused the workflow execution to fail was: 1

--

Error executing process > 'NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF_FILTER (Danio_rerio.GRCz11.dna.fa)'

Caused by:
Process NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF_FILTER (Danio_rerio.GRCz11.dna.fa) terminated with an error exit status (1)

Command executed:

filter_gtf.py
--gtf Danio_rerio.GRCz11.110.gtf
--fasta Danio_rerio.GRCz11.dna.fa
--prefix Danio_rerio.GRCz11.dna

cat <<-END_VERSIONS > versions.yml
"NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF_FILTER":
python: $(python --version | sed 's/Python //g')
END_VERSIONS

Command exit status:
1

Command output:
(empty)

Command error:
Traceback (most recent call last):
File "/mnt/jfs/nextflow/tmp/4e/db93f6b10de84f9a8faf3814e762c0/bin/filter_gtf.py", line 73, in
filter_gtf(args.fasta, args.gtf, args.prefix + ".filtered.gtf", args.skip_transcript_id_check)
File "/mnt/jfs/nextflow/tmp/4e/db93f6b10de84f9a8faf3814e762c0/bin/filter_gtf.py", line 32, in filter_gtf
if tab_delimited(gtf_in) != 8:
File "/mnt/jfs/nextflow/tmp/4e/db93f6b10de84f9a8faf3814e762c0/bin/filter_gtf.py", line 26, in tab_delimited
data = f.read(102400)
File "/usr/local/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

@drpatelh
Copy link
Member

drpatelh commented Jan 3, 2024

Thanks @heathfuqua ! Are you able to share the file with us somehow or point us to where we can download it please?

@heathfuqua
Copy link
Author

heathfuqua commented Jan 3, 2024 via email

@pinin4fjords
Copy link
Member

The gunzip error is suspicious, that should work just fine.

In any case I can't replicate that issue by downloading that GTF, gunzipping it, and running the script in a conda env. If your data are public, could you provide your sample sheet please? That will allow us to run the workflow and see if we can replicate things at that level.

@drpatelh drpatelh modified the milestones: 3.14.0, 3.15.0 Jan 4, 2024
@pinin4fjords
Copy link
Member

Also:

  • If you could re-download the reference and try again (without resume) (if you haven't already) that would exclude the possibility of corruption.
  • If you check the contents of the working directory for that process, verifying the format of the input GTF staged there, that will exclude an issue with transfers in AWS.

@heathfuqua
Copy link
Author

Ok, a colleague just launched run in 3.13.2 using the redownloaded files and had no problems, so clearly I made an error somewhere along the way and misattributed that as a bug.
Thanks again for the help and sorry for the wasted time.

@pinin4fjords
Copy link
Member

No worries, thanks for letting us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants