Skip to content

metaGEM parser

Francisco Zorrilla edited this page Mar 29, 2021 · 8 revisions

Overview

The purpose of this page is to explain the inner workings of the metaGEM.sh parser, which is designed to simplify metaGEM's user experience/interface. Note that the procedures described below are carried out automatically by the metaGEM.sh parser file, which orchestrates the submission of jobs on the cluster.

Most importantly, the parser:

  1. Configures the Snakefile to execute the user defined rule
  2. Configures the config.yaml file to set root path
  3. Configures the cluster_config.json file based on user input
  4. Submits jobs

Each of these operations is discussed in further detail below.

Please see the tutorial for a demonstration of how to use the metaGEM.sh parser.

sup_fig1

1. Configuring the Snakefile

The metaGEM.sh parser takes care of modifying the output string of rule all at the top of the Snakefile in order to expand the wildcards of the desired output rule. This is done because target rules cannot contain wildcards in Snakemake-land. In short, the metaGEM.sh parser stores a string associated with recognized user input tasks such as fastp, megahit, concoct, etc. The parser will check the user input for the --task|-t flag, and compare that to a list of list of recognized tasks. If the user input task matches the recognized task, then a task-specific string is defined as shown below:

elif [ $task == "fastp" ]; then
  string='expand(config["path"]["root"]+"/"+config["folder"]["qfiltered"]+"/{IDs}/{IDs}_1.fastq.gz", IDs = IDs)'
  if [ $local == "true" ]; then
      submitLocal
  else
      submitCluster
  fi

The submitLocal or submitCluster functions will then use this task-specific string to modify the output of rule all in line 22 of the Snakefile:

# Parse Snakefile rule all (line 22 of Snakefile) input to match output of desired target rule stored in "$string". Note: Hardcoded line number.
echo "Parsing Snakefile to target rule: $task ... "
sed  -i "22s~^.*$~        $string~" Snakefile

Please refer to the Snakefile wiki page for more information.

2. Configuring the config.yaml file

In order for metaGEM to be able to find its way around your files/cluster we need to ensure that the root directory in the config.yaml file is set to the current directory, i.e. whatever location you are running metaGEM from. Note that this is done automatically every time you run the metaGEM.sh parser with the following code:

# Set root folder
echo "Setting current directory to root ... "
root=$(pwd)
sed  -i "2s~/.*$~$root~" config.yaml # hardcoded line for root, change the number 2 if any new lines are added to the start of config.yaml

Please refer to the config.yaml wiki page for more information.

3. Configuring the cluster_config.json file

Finally, the parser prepares the cluster_config.json file for job submission by setting the desired resources as defined by user inputs --cores|-c, --mem|-m, and --hours|-h. Please refer to the cluster_config.json wiki page for more information.

4. Submitting jobs

After configuring the above mentioned files, the metaGEM.sh parser will display the modified config.yaml and cluster_config.json files for user verification. After the user confirms that the files are properly configured then metaGEM.sh performs a dry run of the jobs, meaning that it checks that everything is properly configured and rule dependencies can be properly resolved. You will generally get an error message here if the wildcards or Snakefile are not properly configured. After the user verifies that the dry run jobs expanded wildcards as expected then the jobs are finally submitted to the cluster workload manager.

Usage

You may also refer to the metaGEM.sh help message for information regarding flags and available tasks.

_________________________________________________________________________/\\\\\\\\\\\\___/\\\\\\\\\\\\\\\___/\\\\____________/\\\\_        
 _______________________________________________________________________/\\\//////////___\/\\\///////////___\/\\\\\\________/\\\\\\_       
  __________________________________________/\\\________________________/\\\______________\/\\\______________\/\\\//\\\____/\\\//\\\_      
   ____/\\\\\__/\\\\\________/\\\\\\\\____/\\\\\\\\\\\___/\\\\\\\\\_____\/\\\____/\\\\\\\__\/\\\\\\\\\\\______\/\\\\///\\\/\\\/_\/\\\_     
    __/\\\///\\\\\///\\\____/\\\/////\\\__\////\\\////___\////////\\\____\/\\\___\/////\\\__\/\\\///////_______\/\\\__\///\\\/___\/\\\_    
     _\/\\\_\//\\\__\/\\\___/\\\\\\\\\\\______\/\\\_________/\\\\\\\\\\___\/\\\_______\/\\\__\/\\\______________\/\\\____\///_____\/\\\_   
      _\/\\\__\/\\\__\/\\\__\//\\///////_______\/\\\_/\\____/\\\/////\\\___\/\\\_______\/\\\__\/\\\______________\/\\\_____________\/\\\_  
       _\/\\\__\/\\\__\/\\\___\//\\\\\\\\\\_____\//\\\\\____\//\\\\\\\\/\\__\//\\\\\\\\\\\\/___\/\\\\\\\\\\\\\\\__\/\\\_____________\/\\\_ 
        _\///___\///___\///_____\//////////_______\/////______\////////\//____\////////////_____\///////////////___\///______________\///__
        
        
Usage: bash metaGEM.sh [-t|--task TASK] 
                       [-j|--nJobs NUMBER OF JOBS] 
                       [-c|--cores NUMBER OF CORES] 
                       [-m|--mem GB RAM] 
                       [-h|--hours MAX RUNTIME]
                       [-l|--local]

Snakefile wrapper/parser for metaGEM. 

Options:
  -t, --task        Specify task to complete:

                        SETUP
                            createFolders
                            downloadToy
                            organizeData

                        WORKFLOW
                            fastp 
                            megahit 
                            crossMap 
                            concoct 
                            metabat
                            maxbin 
                            binRefine 
                            binReassemble 
                            extractProteinBins
                            carveme
                            memote
                            organizeGEMs
                            smetana
                            extractDnaBins
                            gtdbtk
                            abundance 
                            grid
                            prokka
                            roary

                        VISUALIZATION (in development)
                            qfilterVis
                            assemblyVis
                            binningVis
                            taxonomyVis
                            modelVis
                            interactionVis
                            growthVis

  -j, --nJobs       Specify number of jobs to run in parallel
  -c, --nCores      Specify number of cores per job
  -m, --mem         Specify memory in GB required for job
  -h, --hours       Specify number of hours to allocated to job runtime
  -l, --local       Run jobs on local machine for non-cluster usage
Clone this wiki locally