High Throughput GPU Cluster (HTGC)

General Information

The CS High Throughput GPU Cluster (HTCC) is an HTCondor cluster which focuses on GPU related computational applications like tensorflow, opencv, and matlab. HTCondor is a job submission and queuing system. This system provides process level parallelization for computational intensive tasks. All CS staff and students having a valid CSLab UNIX account are eligible to use it.

The Cluster

Currently, the HTGC is composed of one job submission node and 19 job execution slots as shown below:

GPU 7 x Nvidia V100 12  x  Nvidia K80
Maximum memory size per process 128 GB 64 GB
O/S Ubuntu 16.04 Ubuntu 16.04
GPU Capability 7.0 3.7
GPU Memory 16 GB 12 GB
GPU CUDA Version 9.0 9.0
CUDA Driver Version 9.2 9.2

The HTGC can be accessed by any Secure-Shell Clients connecting to

htgc1.cs.cityu.edu.hk (within CS departmental network)

Please do not run jobs on this head node. Jobs running for longer than an hour will be killed without prior notice

User Data

Besides users' home directories, all nodes in the HTGC mount a 12TB shared NFS storage on path '/public'. Users can make their own folder there. Each user account has a default quota of 200GB disk space in '/public'. There are no backup for files in '/public' and all files not accessed for 30 days will be removed.

Job submission script

To submit jobs to the HTGC, a submission script is needed. Below is a simple example, create a file called

sample.condor

and put the follow lines to it


executable = myproc.sh     # normally a shell script
error      = myproc.err
log        = myproc.log

arguments  = arg1 ...   # command line arguments
input      = arg1.in    # optional file for stdin
output     = arg1.out   # optional file for stdout
queue                   # submit a single job

executable = myproc2            # submit another job in the same script
arguments  = $(Process) ...     # Process ID as argument
input      = $(Process).in      # optional file depends on Process ID
output     = $(Process).out
queue 4                         # submit 4 jobs with Process ID 0..3
.
.

 

where ‘myproc.sh’ is a normal shell script which can be run under normal ssh terminal sessions.
To submit jobs, simply use the condor_submit command like

# condor_submit sample.condor

No matter how many jobs are submitted, each user can have at most 5 jobs executed at the same time.

Sample condor demo files can be found at /public/condor_demo

Frequently used HTCondor commands

Job submission:   /usr/bin/condor_submit
Job enquiry:      /usr/bin/condor_q
Job removal:      /usr/bin/condor_rm {Job ID}
HTCondor Status:  /usr/bin/condor_status

For detailed HTCondor references, please refer to the link http://research.cs.wisc.edu/htcondor/

For any queries, please send email to support[at]cs.cityu.edu.hk