Cluster Submission Script (BBC): SGE Single Node

This tutorial provides examples of the header for scripts intended to run jobs on the BBC cluster.  The BBC clusters runs the Sun Grid Engine (SGE) scheduler and requires specific syntax to request resources.  For more information on the BBC cluster in terms of queues and resources, please review this guide.

Data should be stored on the archival cloud-based storage when not actively being analyzed.  This space is backed up and provides the most storage space.  You should however NOT run any analysis from this location as it is not ideal for frequent read/write activities.  This space can be navigated to via this path: /archive.  Here, you are free to create folders for your files and will have access to any folders you create.  We recommend you create one with your username and organize projects as sub-folders.

To begin analysis on data that lives in /archive presently, you will copy data to a working directory known as /tempdata3.  This area can store your scripts, temporarily files, and output files until you are done with the analysis.

  1. Copy data to /tempdata3
  2. Run your script for the data analysis
  3. Once the analysis is complete, copy results to /archive

A single script that completes all of these steps will have the following four parts described here.  We will work with a hypothetical dataset where the user adm123 has sequenced the Tyrannosaurus rex genome and the raw files are stored at /archive/dinosaurs/trex.  Within this directory, two text files (FASTQ) files exist:

trex_R1.fastq
trex_R2.fastq

We would like to map these two files (raw reads in FASTQ format) against the trex genome using the bowtie2 short read aligner.  This package is already installed on the BBC cluster and we know how to call the application inside of our script because we looked at the software page.  We will create the script with the name, trex.sh with a command line text editor which is installed on the system (nano is easiest followed by vim or emacs).

  1. Header: Describing computational resources and other parameters
  2. Transferring data for the run
  3. Executing the command
  4. Transferring results to the archive

The final script will have all four sections included. A copy of the Final Script is kept on the BBC server and accessible at this location: /common/Tutorial/Template_Scripts/bowtie2_trex.sh

Section 1 : Header

Below is the information that will be the part of header.

In general , you just need to change the job name (-N), your e-mail (-M), the preferred queue (-q), the number of cores needed (-pe), and possible the prefix for your error (-e) and output files (-o).

#!/bin/bash
#$ -N bowtie2_trex
#$ -M user@uconn.edu
#$ -q molfind.q
#$ -m ea
#$ -S /bin/bash
#$ -cwd
#$ -pe smp 4
#$ -o bt2_$JOB_ID.out
#$ -e bt2_$JOB_ID.err

Detailed explanation of each SGE variable:

#!/bin/bash 

Specifies that bash shell has to be used to interpret the script

#$ -N bowtie2_trex 

Names the job, easy to identify this particular job.

#$ -M first.last@uconn.edu 

This will send user an email about the status of the job.

#$ -m ea

Specifies when the user has to be notified through email , when it begins(b), error(e), abort (a) or suspended(s).

#$ -q molfind.q

Specifies the queue.  Here it is molfind.q, other queues are lowpri.q, highpri.q, highmem.q and all.q.   See this guide for queue information.

#$ -S /bin/bash

Source your bash profile. Helpful if you have environment variables set up in your .bash_profile

#$ -cwd

Execute the job from the current working directory (cwd).

#$ -pe smp 4

4 cores requested for this script to run. On BBC, this can generally go up to 8 on the standard queue and 16 on the highmem.q or highpri.q.

#$ -o bowtie2_$JOB_ID.out
#$ -e bowtie2_$JOB_ID.err

Renames the output log file(bowtie2_$JOB_ID.out) and error file (bowtie2_$JOB_ID.err).  $JOB_ID refers to the job ID number that is assigned when script is submitted using qsub.

After composing our script, when we submit trex.sh to cluster using qsub as shown below, the cluster will output a line confirming the submission with JOB ID number.

qsub trex.sh 
Your job 239590 ("bowtie2_trex ") has been submitted

In this example $JOB_ID= 239590. Then the output error and log files in your home directory will be bowtie2_239590.out and bowtie2_239590.err.

Section 2 Transferring data to /tempdata3 from the /archive:

PROJECT=trex
DATA_DIR="/tempdata3/$USER/$PROJECT/"$PROJECT"_DATA_"$JOB_ID
RESULT_DIR="/tempdata3/$USER/$PROJECT/"$PROJECT"_RESULTS_"$JOB_ID
mkdir -p $DATA_DIR $RESULT_DIR
cp /archive/dinosaurs/trex/trex_R1.fastq $DATA_DIR cp /archive/dinosaurs/trex/trex_R2.fastq $DATA_DIR

The set of commands above creates a unique directory for each of your jobs so the output of each analysis is stored in a separate directory.

PROJECT=trex

Sets $PROJECT variable to trex.  Anywhere $project appears in script it will be replaced by trex.

DATA_DIR="/tempdata3/$USER/$PROJECT/"$PROJECT"_DATA_"$JOB_ID 
RESULT_DIR="/tempdata3/$USER/$PROJECT/"$PROJECT"_RESULTS_"$JOB_ID
 

Sets $DATA_DIR and $RESULT_DIR variables to /tempdata3/adm123/trex/trex_DATA_239590 and /tempdata3/adm123/trex/trex_RESULTS_239590 respectively. $JOB_ID= 239590 when we run the script, as described in header section.

mkdir -p $DATA_DIR $RESULT_DIR

The above commands will create appropriate directories to store data and results.

cp /archive/dinosaurs/trex/trex_R1.fastq $DATA_DIR
cp /archive/dinosaurs/trex/trex_R2.fastq $DATA_DIR

These commands copy the files, trex_R1.fastq and trex_2.fastq from /archive/dinosaurs/trex/ to /tempdata3/adm123/trex/trex_DATA_239590.

Now the data files are present in /tempdata3/adm123/trex-239590 from where compute nodes can access them efficiently for the analytical run.

Section 3: Executing bowtie2 command:

cd $DATA_DIR
Module load bowtie2/2.2.9
bowtie2 -p 8 -x trex -1 trex_R1.fastq -2 trex_R2.fastq -S trex.sam
samtools view –bhS trex.sam –o trex.bam

Detailed explanation:

cd $DATA_DIR

Change directory to /tempdata3/adm123/trex/trex_DATA_239590

Module load bowtie2/2.2.9

Load the module bowtie2/2.2.9. Each modulefile contains the information needed to configure the shell for an application. Find list of available modules using module avail at the command prompt.

bowtie2 -p 8 -x trex -1 trex_R1.fastq -2 trex_R2.fastq -S trex.sam

This line calls the bowtie2 application with options and arguments of your choosing.

samtools view –bhS trex.sam –o trex.bam

Converts the resulting alignment file generated in SAM format (.sam) to the compressed (smaller) binary format known as BAM (.bam).  Read more about common bioinformatic file formats here.

Section 4: Transferring final results to /archive:

cat $0 >> README_script_$JOB_ID.txt
cp –a /tempdata3/adm123/trex-239590 / archive/dinosaurs/trex/

Detailed explanation:

cat $0 >> README_script_$JOB_ID.txt

Copy the current script as a txt file for your record keeping

cp –a /tempdata3/adm123/trex-239590 /archive/dinosaurs/trex/

Copy all the data from the working directory (tempdata3) back to the archival storage (archive)

Final Script

#########################################################################
## Section 1: Header                                                    #
#########################################################################
#!/bin/bash
#$ -N bowtie2_trex
#$ -M first.last@uconn.edu
#$ -q molfind.q
#$ -m ea
#$ -S /bin/bash
#$ -cwd
#$ -pe smp 4
#$ -o bt2_$JOB_ID.out
#$ -e bt2_$JOB_ID.err
#########################################################################
## Section 2: Transfering Data from /archive to /tempdata3              #
#########################################################################
PROJECT=trex
DATA_DIR="/tempdata3/$USER/$PROJECT/"$PROJECT"_DATA_"$JOB_ID
RESULT_DIR="/tempdata3/$USER/$PROJECT/"$PROJECT"_RESULTS_"$JOB_ID
mkdir -p $DATA_DIR $RESULT_DIR
cp /archive/dinosaurs/trex/trex_R1.fastq $DATA_DIR
cp /archive/dinosaurs/trex/trex_R2.fastq $DATA_DIR
#########################################################################
## Section 3: Running the executable                                    #
#########################################################################
cd $DATA_DIR
Module load bowtie2/2.2.9
bowtie2 -p 4 -x trex -1 trex_R1.fastq -2 trex_R2.fastq -S trex.sam
samtools view –bhS trex.sam –o trex.bam
#########################################################################
## Section 4: Copy the resutls back to /archive                         #
#########################################################################
cat $0 >> $RESULT_DIR/README_script_$JOB_ID.txt
cp –a /tempdata3/adm123/trex-239590 /archive/dinosaurs/trex/