High Throughput GPU Cluster 2 (HTGC2)

General Information

The CS High Throughput GPU Cluster (HTGC2) is an HTCondor (version 8.6.8) cluster which focuses on single-precision GPU-related computational applications like tensorflow, keras, torch and matlab. HTCondor is a job submission and queuing system. This system provides process-level parallelization for computationally intensive tasks. All CS staff and students having a valid CSLab UNIX account are eligible to use it.

The Cluster

Currently, the HTGC2 is composed of one job submission node and 16 job execution slots as shown below:

GPU	16 x RTX2080Ti
Maximum memory size per job	128 GB
O/S Ubuntu	18.04
GPU Memory	11 GB
CUDA Runtime Version	10.0
CUDA Driver Version	10.2
CUDA Capability	7.5
CUDA Device Name	GeForce RTX 2080 Ti

The HTGC2 can be accessed by any Secure-Shell (ssh) Clients connecting to

htgc2.cs.cityu.edu.hk (within CS departmental network)

which is the job submission node and does not equip with a GPU card. This cluster only supports python3 for AI/learning applications like tensorflow and Keras.

To compile and test codes, please log on

htgc2t.cs.cityu.edu.hk

User Data

Besides users' home directories, all nodes in the HTGC2 share the NFS storage with other clusters on path '/public'. Users can make their own folders there if not yet do so. Each user account has a default quota of 200GB disk space in '/public'. There are no backup for files in '/public' and all files not accessed for 30 days will be removed.

Job submission script

To submit jobs to the HTGC2, a submission script (version 8.6.8) is needed. Below is a sample text file called

myjob.condor

which has the following contents:

executable = myproc.sh     # normally a shell script
requirements = (CUDADeviceName == "GeForce RTX 2080 Ti")  # optional parameter 
error      = myproc.err
log        = myproc.log

arguments  = arg1 ...   # command line arguments for myproc.sh
input      = arg1.in    # optional file for stdin
output     = arg1.out   # optional file for stdout
queue                   # submit a single job

executable = myproc2            # submit another job in the same script
arguments  = $(Process) ...     # Process ID as argument
input      = $(Process).in      # optional file depends on Process ID
output     = $(Process).out
queue 4                         # submit 4 jobs with Process ID 0..3
.
.

where ‘myproc.sh’ is a normal executable shell script which can be run under a normal ssh terminal session.

To submit jobs, simply use the condor_submit command like

# condor_submit myjob.condor

No matter how many jobs are submitted, each user can have at most 5 jobs executed concurrently.

To test a job, please submit the job using the '--batch-name Test' option:

# condor_submit --batch-name Test myjob.condor

Test jobs will be terminated after running for 10 minutes.

Sample condor demo files can be found at /public/condor_demo

Frequently used HTCondor commands

Job submission:   /usr/bin/condor_submit
Job enquiry:      /usr/bin/condor_q
Job removal:      /usr/bin/condor_rm {Job ID}
HTCondor Status:  /usr/bin/condor_status

For detailed HTCondor references, please refer to the link https://research.cs.wisc.edu/htcondor/manual/v8.6/Contents.html

For any queries, please send an email to support[at]cs.cityu.edu.hk