You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 4 columns, and a header row as shown in the examples below.
--input '[path to samplesheet file]'
The sample
identifiers have to be the same when you have re-sequenced the
same sample more than once e.g. to increase sequencing depth. The pipeline will
concatenate the raw reads before performing any downstream analysis. Below is
an example for the same sample sequenced across 3 lanes:
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,unstranded
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,unstranded
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,unstranded
The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below.
A final samplesheet file consisting of both single- and paired-end data may
look something like the one below. This is for 6 samples, where
TREATMENT_REP3
has been sequenced twice.
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,forward
CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz,forward
CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz,forward
TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz,,reverse
TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz,,reverse
Column | Description |
---|---|
sample |
Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (_ ). |
fastq_1 |
Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
fastq_2 |
Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
strandedness |
Sample strand-specificity. Must be one of unstranded , forward or reverse . |
An example samplesheet has been provided with the pipeline.
NB: The
group
andreplicate
columns were replaced with a singlesample
column as of v3.1 of the pipeline. Thesample
column is essentially a concatenation of thegroup
andreplicate
columns, however it now also offers more flexibility in instances where replicate information is not required e.g. when sequencing clinical samples. If all values ofsample
have the same number of underscores, fields defined by these underscore-separated names may be used in the PCA plots produced by the pipeline, to regain the ability to represent different groupings.
An executable Python script called fastq_dir_to_samplesheet.py
has been
provided if you would like to auto-create an input samplesheet based on a
directory containing FastQ files before you run the pipeline (requires Python 3
installed locally) e.g.
wget -L https://raw.githubusercontent.com/SpikyClip/llrnaseq/master/bin/fastq_dir_to_samplesheet.py
./fastq_dir_to_samplesheet.py <FASTQ_DIR> samplesheet.csv \
--strandedness reverse \
--read1_extension _1.fastq.gz \
--read2_extension _2.fastq.gz \
--sanitise_name \
--SANITISE_NAME_DELIMITER '_' \
--SANITISE_NAME_INDEX 1
The read extension arguments specify the suffix pattern for read1
and read2
files. the --sanitise_name
argument (default false
) allows the extraction
of a sample name from the file basename. Here, it splits the base name by
underscores (_
) and returns the first index group (base 1) as the sample
name. If --sanitise_name
is not provided, the sample name defaults to the
file basename. The --help
argument will show the full list of arguments.
By default, the pipeline uses
HISAT2 aligner (i.e.
--aligner hisat2
) to map the raw FastQ reads to the reference genome.
The minimum reference genome requirements are a FASTA and GTF file. However,
it is more storage and compute friendly if you are able to re-use reference
genome files as efficiently as possible. It is recommended to use the
--save_reference
parameter if you are using the pipeline to build new indices
(e.g. those unavailable on AWS
iGenomes) so that you can save them
somewhere locally. The index building step can be quite a time-consuming
process and it permits their reuse for future runs of the pipeline to save disk
space. You can then either provide the appropriate reference genome files on
the command-line via the appropriate parameters (e.g. --hisat2_index '/path/to/hisat2/index/'
) or via a custom config file.
- If
--genome
is provided then the FASTA and GTF files (and existing indices) will be automatically obtained from AWS-iGenomes unless these have already been downloaded locally in the path specified by--igenomes_base
.
NB: Compressed reference files are also supported by the pipeline i.e. standard files with the
.gz
extension and indices folders with thetar.gz
extension.
The typical command for running the pipeline on the LIMS-HPCC is as follows:
nextflow run SpikyClip/rnaseq \
-profile lims \
--input samplesheet.csv \
--genome GRCh37
This will launch the pipeline with the lims
configuration profile. See below
for more information about profiles.
Note that the pipeline will create the following files in your working directory:
work # Directory containing the nextflow working files
results # Finished results (configurable, see below)
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
nextflow pull SpikyClip/llrnaseq
NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below. When using Biocontainers, most of these software packaging methods pull Docker containers from quay.io e.g FastQC except for Singularity which directly downloads Singularity images via https hosted by the Galaxy project and Conda which downloads and installs software locally from Bioconda.
We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation.
Note that multiple profiles can be loaded, for example: -profile test,docker
- the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles.
If -profile
is not specified, the pipeline will run locally and expect all
software to be installed and available on the PATH
. This is not
recommended.
lims
- A configuration profile for the LIMS-HPCC.
docker
- A generic configuration profile to be used with Docker
singularity
- A generic configuration profile to be used with Singularity
podman
- A generic configuration profile to be used with Podman
shifter
- A generic configuration profile to be used with Shifter
charliecloud
- A generic configuration profile to be used with Charliecloud
conda
- A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.
You can also supply a run name to resume a specific run: -resume [run-name]
.
Use the nextflow log
command to show previous run names.
Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.
To see a full list of arguments execute the following command:
nextflow run SpikyClip/llrnaseq --help
Pipeline settings can be configured on a per-user profile basis or on a per-project basis. See the main Nextflow documentation for more information about creating your own configuration files.
For example, some of the modules loaded using the -profile lims
option may be
outdated. To override module loading for a particular process in the lims
profile so that Nextflow uses your own installation of a program, add the
following nextflow.config
file to your .nextflow
folder (affects all user's
runs) or in your project folder (affects that project only):
profiles {
lims {
process {
withName: 'STRINGTIE*' {module = ''}
}
}
}
This overrides the module loading for all processes starting with STRINGTIE
allowing Nextflow to use a local installation of stringtie
that is available
on $PATH
.
Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.
The Nextflow -bg
flag launches Nextflow in the background, detached from your
terminal so that the workflow does not stop if you log out of your session. The
logs are saved to a file.
Alternatively, you can use screen
/ tmux
or similar tool to create a
detached session which you can log back into at a later time. Some HPC setups
also allow you to run nextflow within a cluster job submitted your job
scheduler (from where it submits more jobs).