Job Submission Script (SGE): For an Array Job

This tutorial will guide you to compose an array job, where it will submit multiple input files which use the same processor to run on BBC cluster.

We recommend that all times the data should be stored in /archive storage space. Once you decided to work on set of files for analysis we recommend the following work flow.

In this workflow it will use your input fasta files will be initially stored in your current working directory and then it will be moved to the /tempdata3 directory for the analysis. Once the analysis is done, the data will be transferred to the /archive for storage (until further analysis is needed on the data set).

  1. Copy data to /tempdata3
  2. Run your script for the data analysis
  3. Once the analysis is completed move results to the /archive directory

 

Keeping with the above work flow, the script will be divided in to 4 main sections:

  1. Describing computational resources and other parameters
  2. Transferring the data for the run
  3. Executing the job
  4. Moving the data to the archive for store

The final script will have all the above four sections included. A copy of the Final Script is kept in the bbc server where you can access it using the following path /common/Tutorial/Template_Scripts/template_Array_Job.sh

Section 1: Setting up the parameters and other resources

In this section, you will be giving your job a name for identification, and will ask the server to send the status reports to your email address, while setting up how many independent jobs will be running at a given time.

#$ -N blastp_array_job
#$ -M first.last@uconn.edu
#$ -m besa  
#$ -t 1-4:1
#$ -tc 2
#$ -pe smp 1
#$ -S /bin/bash
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err

 

Explanation

#$ -N blastp_array_job
Specify  the job name to be used to identify this run, "blast_array_job"
#$ -M first.last@uconn.edu
Email address (change to yours), so you can receive email conformations on the status of your current job
#$ -m besa
With the above flags you can specify  what kind of status reports you would like to receive. The mailing options are: b=beginning, e=end, s=suspended, n=never, a=aborted or suspended
#$ -t 1-4:1
This sets the task as an array job which range from 1 to 4 with a step size of 1, where it will submit total of 4 jobs, one at a time
#$ -tc 2
This sets the maximum number of concurrent tasks to 2, so that no more than 2 jobs will be run simultaneously at a given time.
#$ -pe smp 1
Number of processors requesting for a single job
#$ -S /bin/bash
Specify that bash shell should be used to process this script
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err
Specify the output file and the error file using the pseudo environment variables: $JOB_NAME and $TASK_ID

 

Section 2: Setting up the directories and transferring the data files

In this section, we will create a working folder and a destination folder. It is not necessary to create these folders before hand. We will ask the program to look for these folders/directories before hand, and if not, create them. Then copy the input file, to the working directory.

INPUTFILES=(test1.fasta test2.fasta test3.fasta test4.fasta)
PROJECT_SUBDIR="MyArrayJob"

INPUTFILENAME="${INPUTFILES[$SGE_TASK_ID - 1]}"
WORKING_DIR="/tempdata3/$USER/$PROJECT_SUBDIR-$SGE_TASK_ID"

if [ ! -d "$WORKING_DIR" ]; then
    mkdir -p $WORKING_DIR
fi

DESTINATION_DIR=”/archive/$USER/$PROJECT_SUBDIR/$SGE_TASK_ID-$INPUTFILENAME”

if [ ! -d "$DESTINATION_DIR" ]; then
    mkdir -p $DESTINATION_DIR
fi

cp $INPUTFILENAME $WORKING_DIR/

 

Explanation of the above commands:

INPUTFILES=(test1.fasta test2.fasta test3.fasta test4.fasta)
First we will create an array to hold the fasta input files which is in the home directory
PROJECT_SUBDIR="test"
Name the directory (assumed to be a direct sub directory of $HOME) from which the file is located
INPUTFILENAME="${INPUTFILES[$SGE_TASK_ID - 1]}"
Pull data file name from list defined above according to job id
WORKING_DIR="/tempdata3/$USER/$PROJECT_SUBDIR-$SGE_TASK_ID"
Specify the working directory which will be located in the /tempdata3. This will be the location where the calculations/analysis will be carried out.
if [ ! -d "$WORKING_DIR" ]; then
    mkdir -p $WORKING_DIR
fi
It looks for the working directory in the above specified location ($WORKING_DIR). If the working directory is not found, it will create the working directory for you in the specified location.  
The -p means "create parent directories as needed"
DESTINATION_DIR="/archive/$USER/$PROJECT_SUBDIR/$PROJECT_SUBDIR-$SGE_TASK_ID"
Specify destination directory. This is were you will be storing the data after the run. (It will be a subdirectory of your user directory in the /archive)
if [ ! -d "$DESTINATION_DIR/$DESTINATION_SUB_DIR" ]; then
    mkdir -p $DESTINATION_DIR
fi
If destination directory does not exist, create it
The -p in mkdir means "create parent directories as needed"
cp $INPUTFILENAME $WORKING_DIR/
Copy the input file to the newly created working directory in /tempdata3
cd $WORKING_DIR
Navigate to the working directory

 

Section 3: Executing the program of interest

In this section you will be executing the program. If it is a multi processor job, include the number of process you are requesting. Make sure to match it with the number of processes in the header section: #$ -pe smp n ( n = number of processors ). According to this section, it will change to the newly created directory in each instance and execute the command using each file name given using INPUTFILES in the section 2.

cd $WORKING_DIR

blastp -query $INPUTFILENAME -db /usr/local/blast/data/refseq_protein -num_alignments 5 -num_descriptions 5 -out my-results

 

Explanation of the above commands:

cd $WORKING_DIR
Navigate to the working directory
blastp -query $INPUTFILENAME -db /usr/local/blast/data/refseq_protein -num_alignments 5 -num_descriptions 5 -out my-results
Run the program, in this case it is the "blastp"

 

Section 4: Storing the data and cleaning the working directory

Once the run is complete, you need to move the data out of the working directory. The /archive directory is a place you can keep your data for safe keeping. After coping the data to the archive, the data in the working directory need to be deleted. (More information on the storage facilities in the BBC server can be found at our web page on “Understanding the Bioinformatics Cluster“)

cp * $DESTINATION_DIR
rm -rf $WORKING_DIR

Explanation of the above commands:

cp * $DESTINATION_DIR
Since your currently in the working directory, it will copy all your files to the destination directory. 
The * is a wild card character, which indicates "all" the files.
rm -rf $WORKING_DIR
Once you have copied all the files, this remove command will delete all the files and subfolders in the working directory.
-r  --recursive Remove directories and their contents recursively
-f  --force     Ignore nonexistent files, never prompt

 

Final Script

#########################################################################
## Section 1: Header                                                    #
#########################################################################

#!/bin/bash

# Specify name to be used to identify this run
#$ -N template_Array_Job

# Email address (change to yours)
#$ -M first.last@uconn.edu

# Specify mailing options: b=beginning, e=end, s=suspended, n=never, a=aborted or suspended
#$ -m besa

# This sets the task range in the array from 1 to 4 with a step size of 1
#$ -t 1-4:1

# This sets the maximum number of concurrent tasks to 2, so that no more than 2 jobs will be run at once
#$ -tc 2

# Specify the number of processors for the single job
#$ -pe smp 1

# Specify the queue 
#$ -q molfind.q

# Change directory to the current 
#$ -cwd

# Specify that bash shell should be used to process this script
#$ -S /bin/bash

# Specify the output file
#$ -o $JOB_NAME_$TASK_ID.out

# Specify the error file
#$ -e $JOB_NAME_$TASK_ID.err



#########################################################################
## Section 2: Transfering Data from current directory to /tempdata3     #
#########################################################################
# Specify the filenames
INPUTFILES=(test1.fasta test2.fasta test3.fasta test4.fasta)

# Name the Project directory
PROJECT_SUBDIR="MyArrayJob"

# Pull data file name from list defined above according to job id
INPUTFILENAME="${INPUTFILES[$SGE_TASK_ID - 1]}"

# Specify working directory under the /tempdata3
WORKING_DIR="/tempdata3/$USER/$PROJECT_SUBDIR-$SGE_TASK_ID"

# If working directory does not exist, create it
# The -p means "create parent directories as needed"
if [ ! -d "$WORKING_DIR" ]; then
    mkdir -p $WORKING_DIR
fi

# Specify destination directory (this will be subdirectory of your user directory in the archive)
DESTINATION_DIR="/archive/$USER/$PROJECT_SUBDIR/$PROJECT_SUBDIR-$SGE_TASK_ID"

# If destination directory does not exist, create it
# The -p in mkdir means "create parent directories as needed"
if [ ! -d "$DESTINATION_DIR" ]; then
    mkdir -p $DESTINATION_DIR
fi

# Copy the input data to the working directory
cp $INPUTFILENAME $WORKING_DIR/



#########################################################################
## Section 3:Executing the program                                      #
#########################################################################

# navigate to the working directory
cd $WORKING_DIR

# Run the program
blastp -query $INPUTFILENAME -db /usr/local/blast/data/refseq_protein -num_alignments 5 -num_descriptions 5 -out my-results



#########################################################################
## Section 4: Copy the result to /archive                               #
#########################################################################

cp * $DESTINATION_DIR

# clear the working directory
rm -rf $WORKING_DIR