- What is a cluster
- How to obtain an account
- How to reset a password
- How to access the Xanadu cluster
- HPC resources and limits
- Working with Slurm
- File system
- How to Transfer Data Between Clusters
- Running a job interactively
A desktop or a laptop, in most cases, is inadequate for analysis of large scale datasets (e.g. Genomics) or to run simulations (e.g. protein docking). They lack both the processing power and the memory to execute these analyses. This limitation can be overcome by combining machines (computers) in a predefined architecture/configuration so that they act as a single unit with enhanced computational power and shared resources. This is the basic concept of a high performance cluster. A cluster consists of a set connected computers that work together so that they can be viewed as a single system. Each computer unit in a cluster is referred as ‘node’.
The components of a cluster are usually connected through fast local area networks (“LAN”), with each node running its own instance of an operating system. The benefits of clusters include low cost, elasticity and the ability to run jobs anytime. anywhere.
- Never run anything on the head node of a cluster (where you first login).
- Keep track of what you are running and refrain from using all nodes at the same time.
- Run and write your output files in your home directory or designated locations. Do not run jobs on the archival storage (/linuxshare) but do use this as a stable location to save your completed work.
To obtain an account in Xanadu, you must have a UCH account, also known as a CAM account. The following link will allow you to request this: http://bioinformatics.uconn.edu/contact-us/
Select: “Account Request (Xanadu cluster) ” from the list on the contact-us page.
Once you submit the request, you will receive a CAM form to fill out and upon approval, you will be able to access this cluster.
An interface exists to reset your password here: https://vdusers.cam.uchc.edu/pm/
Your CAM credentials will allow you to access this resource.
Xanadu can be accessed from internal to UCH as well as external (on or off campus).
- For the users who are accessing the cluster within UCH or via the UCH VPN (accessible via your CAM credentials), SSH with:
- For users who are accessing Xanadu from outside (e.g: Storrs campus or via UConn Storrs VPN), SSH with:
Connecting to the Cluster using Windows Computer (Putty)
Windows users will need to use an SSH client to connect to the cluster.Install Putty and configure for use:
Putty Configuration steps.
Open Putty it will open Window1 (see below).
- Provide host name e.g. <user_name>@xanadu-submit-ext.cam.uchc.edu or <user_name>@xanadu-submit-int.cam.uchc.edu
- Expand SSH tab and select X11 (shown in window2)
- Enable X11 forwarding by selecting it. (window2)
- Scroll up the left panel and select Session.(window1)
- Name your session e.g. Xanadu_cluster and click save tab to save.
- Your session name should appear in saved sessions.
- Double click on your session name to connect to server with SSH session.
Connecting to the Cluster with Graphical Interface enable
In order to display graphics from the cluster, we need a software that allows one to use Linux graphical applications remotely. Xming and Xquartz are the display options available for windows and mac respectively. (Download and install it on your local computers)
NOTE: Start the X Server on your machine (Xming/Xquartz), each time you reboot your PC or whenever you want to use X Windows. Once enabled, Xming will appear in your system tray as a black X with an orange circle around the middle.
To log-in to the head node of the cluster run the command Mac or Linux terminal:
ssh -X <user_name>@xanadu-submit-int.cam.uchc.edu
ssh -X <user_name>@xanadu-submit-ext.cam.uchc.edu
Connecting to the Cluster from outside of UConn
When you are outside of UConn, to connect to the cluster you need to use a VPN connection before connecting to the server.
Information on how to connect via Virtual Private Network (VPN) client can be found at the following link: http://remoteaccess.uconn.edu/vpn-overview/connect-via-vpn-client-2/
The server URL to connect to the UConn VPN which requires a NETID login/password:
The server URL to connect to UCHC VPN which requires the CAM login/password:
Xanadu cluster uses the Slurm, which is a highly scalable cluster management and job scheduling system for large and small Linux clusters. The nodes (individual nodes within the cluster) are divided into groups which are called partitions. Xanadu has several partitions available: general, xeon, amd, himem1, himem2, himem3, himem4.
To look up the available partition information you can use ‘sinfo -s’ which will give you the current list:
$ sinfo -s PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST general* up infinite 2/15/0/17 xanadu-[01-11,20-25] xeon up infinite 0/11/0/11 xanadu-[01-11] amd up infinite 2/4/0/6 xanadu-[20-25] himem1 up infinite 0/1/0/1 xanadu-30 himem2 up infinite 0/1/0/1 xanadu-31 himem3 up infinite 0/1/0/1 xanadu-32 himem4 up infinite 0/1/0/1 xanadu-33
In the above the general* is the default partition for the users. Where NODES(A/I/O/T) are a count of a particular configuration by node state in the form of “Available / Idle / Other / Total “.
Xanadu cluster is divided into five partitions:
general partition: will consist of 17 nodes and each will have 32 to 36 cpus in a given configuration, with 128/256 GB memory.
$ sinfo -N -l -p general NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON xanadu-01 1 general* idle 36 2:18:1 257676 15620 1 (null) none xanadu-02 1 general* idle 36 2:18:1 257676 15620 1 (null) none xanadu-03 1 general* idle 36 2:18:1 257676 15620 1 (null) none xanadu-04 1 general* idle 36 2:18:1 257676 15620 1 (null) none xanadu-05 1 general* idle 36 2:18:1 257676 15620 1 (null) none xanadu-06 1 general* idle 36 2:18:1 257676 15620 1 (null) none xanadu-07 1 general* idle 36 2:18:1 257676 15620 1 (null) none xanadu-08 1 general* idle 36 2:18:1 257676 15620 1 (null) none xanadu-09 1 general* idle 36 2:18:1 257676 15620 1 (null) none xanadu-10 1 general* idle 36 2:18:1 128532 15620 1 (null) none xanadu-11 1 general* idle 36 2:18:1 128532 15620 1 (null) none xanadu-20 1 general* mixed 32 4:8:1 128745 15620 1 (null) none xanadu-21 1 general* mixed 32 4:8:1 128745 15620 1 (null) none xanadu-22 1 general* allocated 32 4:8:1 257760 15620 1 (null) none xanadu-23 1 general* allocated 32 4:8:1 257760 15620 1 (null) none xanadu-24 1 general* allocated 32 4:8:1 249696 15620 1 (null) none xanadu-25 1 general* allocated 32 4:8:1 209380 15620 1 (null) none
himem partion: will have 4 separate highmem partitions with each having 32 cores and 512 GB of memory
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON xanadu-30 1 himem1 idle 32 4:8:1 515792 15620 1 (null) none xanadu-31 1 himem2 idle 32 4:8:1 515792 15620 1 (null) none xanadu-32 1 himem3 idle 32 4:8:1 515792 15620 1 (null) none xanadu-33 1 himem4 idle 32 4:8:1 515792 15620 1 (null) none
The general partition, can be further divided according to the processor type used in each node:
xeon partition (nodes consists of Xeon processors)
amd partition (nodes consists of AMD processors)
xeon partition: will have 11 nodes with each having 36 cores and 156 GB
$ sinfo -N -l -p xeon NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON xanadu-01 1 xeon idle 36 2:18:1 257676 15620 1 (null) none xanadu-02 1 xeon idle 36 2:18:1 257676 15620 1 (null) none xanadu-03 1 xeon idle 36 2:18:1 257676 15620 1 (null) none xanadu-04 1 xeon idle 36 2:18:1 257676 15620 1 (null) none xanadu-05 1 xeon idle 36 2:18:1 257676 15620 1 (null) none xanadu-06 1 xeon idle 36 2:18:1 257676 15620 1 (null) none xanadu-07 1 xeon idle 36 2:18:1 257676 15620 1 (null) none xanadu-08 1 xeon idle 36 2:18:1 257676 15620 1 (null) none xanadu-09 1 xeon idle 36 2:18:1 257676 15620 1 (null) none xanadu-10 1 xeon idle 36 2:18:1 128532 15620 1 (null) none xanadu-11 1 xeon idle 36 2:18:1 128532 15620 1 (null) none
amd partition: will have 6 nodes with each having 32 cores and 156 GB of memory
$ sinfo -N -l -p amd NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON xanadu-20 1 amd mixed 32 4:8:1 128745 15620 1 (null) none xanadu-21 1 amd mixed 32 4:8:1 128745 15620 1 (null) none xanadu-22 1 amd allocated 32 4:8:1 257760 15620 1 (null) none xanadu-23 1 amd allocated 32 4:8:1 257760 15620 1 (null) none xanadu-24 1 amd allocated 32 4:8:1 249696 15620 1 (null) none xanadu-25 1 amd allocated 32 4:8:1 209380 15620 1 (null) none
Summary of the nodes associated in the Xanadu cluster:
Sample script for standard job submission.
#!/bin/bash #SBATCH --job-name=myscript #SBATCH -n 1 #SBATCH -N 1 #SBATCH --partition=general #SBATCH --mail-type=END #SBATCH --email@example.com #SBATCH -o myscript.out #SBATCH -e myscript.err echo "Hello World"
A general script will consist of 3 main parts:
- The #!/bin/bash line which allows to run as a bash script
- Parameters for the SLURM scheduler indicated by #SBATCH
- Command submission line(s) which comes from your selected application
The #SBATCH lines indicate the set of parameters for the SLURM scheduler.
#SBATCH --job-name=myscript Is the name of your script
#SBATCH -n 1 Request number of cores for your job
#SBATCH -N 1 This line requests that the cores are all on node. Only change this to >1 if you know your code uses a message passing protocol like MPI. SLURM makes no assumptions on this parameter -- if you request more than one core (-n > 1) and your forget this parameter, your job may be scheduled across nodes; and unless your job is MPI (multinode) aware, your job will run slowly, as it is oversubscribed on the master node and wasting resources on the other(s).
#SBATCH --partition=general This line specifies the SLURM partition (in this instance it will be the general partition) under which the script will be run
#SBATCH --mail-type=END Mailing options to indicate the state of the job. In this instance it will send a notification at the end
#SBATCH --firstname.lastname@example.org Email which the notification should be sent to
#SBATCH -o myscript.out Specifies the file to which the standard output will be appended
#SBATCH -e myscript.err Specifies the file to which standard error will be appended
*To submit a job to a particular partition –partition must be specified in the script
How to submit a job
To submit a script to the cluster can be done using the
sbatch command. Note: Each user can submit a maximum of 50 jobs (50 independent jobs or scripts OR within an array job). All scripts are submitted to the cluster with the following command:
$ sbatch myscript.sh
Monitoring a submitted job
To monitor all the jobs
squeue can be used
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 246233 general STO_001 USER1 CG 4:48 1 xanadu-21 301089 himem ProtMasW USER2 PD 0:00 1 (Priority) 301013 amd ProtMasN USER2 R 5:43:21 1 xanadu-24 301677 general mv_db.sh USER3 R 14:48 1 xanadu-22 297400 himem bfo_111 USER4 R 1-07:16:26 4 xanadu-[30-33]
It will give information on the jobs on all partitions. One important aspect is the state of the job in the queue.
R – Running
PD – Pending
CG – Cancelled
To monitor a particular job
squeue command can be used. In this example, the jobID is 201185. This number is provided at the time of job submission and can be used to reference the job while it is running and after it has completed.
$ squeue -j 201185 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 201185 general myscript [USER_ID] R 0:29 1 xanadu-20
To monitor jobs submitted by a user
$ squeue -u UserID JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 200642 general Trinity UserID R 3:51:29 1 xanadu-22 200637 general Trinity UserID R 3:54:26 1 xanadu-21 200633 general Trinity UserID R 3:55:51 1 xanadu-20
To monitor jobs in a particular partition:
$ squeue -p general JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 287283 general bfo_111 User1 R 15:54:58 2 xanadu-[24-25] 203251 general blastp User2 R 3-02:22:39 1 xanadu-23 203252 general blastp User3 R 3-02:22:39 1 xanadu-23
Display information on a running/completed job
sacct can be used
$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 288052 gmap general pi-wegrzyn 1 COMPLETED 0:0 288052.batch batch pi-wegrzyn 1 COMPLETED 0:0 288775 gmap general pi-wegrzyn 1 RUNNING 0:0
$ sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 288775 JobID JobName Account Partition NTasks AllocCPUS Elapsed State ExitCode ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- 288775 gmap pi-wegrzyn general 1 00:02:55 RUNNING 0:0
To get more information about a specific job
scontrol can be used
$ scontrol show jobid 900001 JobId=900001 JobName=blast UserId=USER1(#####) GroupId=domain users(#####) MCS_label=N/A Priority=5361 Nice=0 Account=pi QOS=general JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=01:39:25 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2017-06-27T16:51:36 EligibleTime=2017-06-27T16:51:36 StartTime=2017-06-27T16:51:36 EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=general AllocNode:Sid=xanadu-submit-int:27120 ReqNodeList=(null) ExcNodeList=(null) NodeList=xanadu-24 BatchHost=xanadu-24 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=blast.sh WorkDir=Plant StdErr=blast.err StdIn=/dev/null StdOut=blast.out Power=
How to cancel a job after submission:
If you need to stop a job which you have submitted, you can use the command
scancel with the JobID number:
$ scancel [jobID number]
To terminate all your jobs:
$ scancel -u [UserID]
Following is some commonly used directories and its purpose:
In order to transfer data directory between clusters it is a good practice to tar the directory and generate md5sum string for the tarred file before transferring . This string should match with md5sum value generated after transferring the file. This is to ensure complete and uncorrupted transfer of data. Steps are listed here,
(1) Tar the directory
data : Directory with data to be tarred
dataset.tar : Name of the output tarred file. It has to be specified.
$ tar –cvf dataset.tar data
(2) Create md5sum string
$ md5sum dataset.tar
(3) Transfer file
$ scp [source] [destination]
case1: When logged on BBC and transferring file to Xanadu
$ scp path/to/dataset.tar email@example.com:$HOME
case2: When logged on Xanadu and transferring file from BBC
$ scp firstname.lastname@example.org: path/to/dataset.tar $HOME
In both cases it will prompt your for password, this password is for the cluster in which you are not currently logged on.
After transfer check the md5sum of transferred file at destination.
$ md5sum dataset.tar
The value or string should match the string generated in step 2.
Untar the file
$ tar –xvf dataset.tar
The first point of contact for a user on the cluster is the head-node/submit-node. Any command that is given at the prompt is executed by the head-node/submit-node. This is not desirable as head-node/submit-node has a range of tasks to perform and its use for computational purpose will slow down its performance. However, often it is convenient/desired to run some commands on command line rather than running them through a script. This can be achieved by initiating an “Interactive session” by executing commands on compute-nodes.
How to Start a Interactive Session
Interactive sessions are only allowed through internal submit node xanadu-submit-int.cam.uchc.edu. Access through internal submit node is possible after establishing UCHC VPN connection. This can be done using pulse secure.
- Open Pulse secure
- Add new connection
- Set Server URL to : vpn.uchc.edu/cam
- Connect and login with CAM ID and Passwd
Once a vpn connection is established, login to internal submit node
Once logged in start an interactive session using
To start a bash session:
$ srun --pty bash
In the following example,
srun executes /bin/hostname using three tasks (-n3), on a single node. Which gets the hostname and the taskid (-l) as the output.
$ srun -n3 -l /bin/hostname 2: xanadu-24.cam.uchc.edu 1: xanadu-24.cam.uchc.edu 0: xanadu-24.cam.uchc.edu