High Throughput GPU Cluster 3 (HTGC3)

General Information

The CS High Throughput GPU Cluster 3 (HTGC3) is an HTCondor cluster which focuses on GPU-related computational applications like tensorflow, opencv, and matlab. HTCondor is a job submission and queuing system. This system provides process-level parallelization for computationally intensive tasks. All CS staff and students having a valid CSLab UNIX account are eligible to use it.

The Cluster

Currently, the HTGC3 is composed of one job submission node and 10 job execution nodes as shown below:

GPU	2+2+4 Nvidia V100	7 x Nvidia V100S
Maximum memory size per process	256GB or 512GB	128GB
O/S	Ubuntu 18.04	Ubuntu 18.04
GPU Memory	32 GB	32 GB
CUDA Runtime Version	10.1	10.1
CUDA Driver Version	11.0	11.0
CUDA Capability	7.0	7.0
CUDA Device Name	Tesla V100-SXM2-32GB	Tesla V100S-PCIE-32GB
Nodes	two dual-GPU nodes one quad-GPU node	7 single-GPU nodes

The HTGC3 can be accessed by any Secure-Shell Clients connecting to

htgc3.cs.cityu.edu.hk (within CS departmental network)

Please do not run jobs on the submission node. Jobs running for longer than an hour will be killed without prior notice

To compile and test codes, please log on

htgc3t.cs.cityu.edu.hk

User Data

Besides users' home directories, all nodes in the HTGC3 mount a shared NFS storage on path '/public' which is shared with other clusters and the gateway server. Users can make their own folder there. Each user account has a default quota of 200GB disk space in '/public'. There is no backup for files in '/public' and all files not accessed for 30 days will be removed.

Job submission script

To submit jobs to the HTGC3, a submission script is needed. Below is a simple example, create a file called

sample.condor

and put the follow lines to it


executable = myproc.sh     # normally a shell script
requirements = (CUDADeviceName == "Tesla V100-SXM2-32GB")  # optional parameter 
request_GPUs = 2           # optional parameter to request upto 4 GPUs for a job

error      = myproc.err
log        = myproc.log

arguments  = arg1 ...   # command line arguments
input      = arg1.in    # optional file for stdin
output     = arg1.out   # optional file for stdout
queue                   # submit a single job

executable = myproc2            # submit another job in the same script
arguments  = $(Process) ...     # Process ID as argument
input      = $(Process).in      # optional file depends on Process ID
output     = $(Process).out
queue 4                         # submit 4 jobs with Process ID 0..3
.
.

where ‘myproc.sh’ is a normal shell script which can be run under normal ssh terminal sessions.
To submit jobs, simply use the condor_submit command like

# condor_submit sample.condor

No matter how many jobs are submitted, each user can have at most 6 jobs executed at the same time.

Jobs do not specify "request_GPUs" will be run on single-GPU and dual-GPU nodes. Jobs run in dual-GPU nodes will have a double resource usage count. To restrict jobs to run on single-GPU nodes, please specify "request_GPUs=1" in the condor submit file.

Sample condor demo files can be found at /public/condor_demo

Frequently used HTCondor commands

Job submission:   /usr/bin/condor_submit
Job enquiry:      /usr/bin/condor_q
Job removal:      /usr/bin/condor_rm {Job ID}
HTCondor Status:  /usr/bin/condor_status

For detailed HTCondor references, please refer to the link http://research.cs.wisc.edu/htcondor/

For any queries, please send an email to support[at]cs.cityu.edu.hk