High Throughput GPU Cluster 1 (HTGC1)

General Information

The CS High Throughput GPU Cluster (HTGC) is an HTCondor cluster which focuses on GPU related computational applications like tensorflow, opencv, and matlab. HTCondor is a job submission and queuing system. This system provides process-level parallelization for computationally intensive tasks. All CS staff and students having a valid CSLab UNIX account are eligible to use it.

The Cluster

Currently, the HTGC is composed of one job submission node and 7 job execution slots as shown below:

GPU 7 x Nvidia V100
Maximum memory size per process 64 GB
O/S Ubuntu 20.04
GPU Memory 16 GB
CUDA Runtime Version 11.2
CUDA Driver Version 11.6
CUDA Capability 7.0
CUDA Device Name Tesla V100-SXM2-16GB

 

 

 

 

 


The HTGC can be accessed by any Secure-Shell Clients connecting to

htgc1.cs.cityu.edu.hk (within CS departmental network)

To compile and test codes, please log on

htgc1t.cs.cityu.edu.hk

User Data

Besides users' home directories, all nodes in the HTGC mount a shared NFS storage on path '/public'. Users can make their own folder there. Each user account has a default quota of 200GB disk space in '/public'. There is no backup for files in '/public' and all files not accessed for 30 days will be removed.

Job submission script

To submit jobs to the HTGC, a submission script is needed. Below is a simple example, create a file called

sample.condor

and put the follow lines to it

executable = myproc.sh     # normally a shell script
requirements = (CUDADeviceName == "Tesla V100-SXM2-16GB")  # optional parameter 
error      = myproc.err
log        = myproc.log

arguments  = arg1 ...   # command line arguments
input      = arg1.in    # optional file for stdin
output     = arg1.out   # optional file for stdout
queue                   # submit a single job

executable = myproc2            # submit another job in the same script
arguments  = $(Process) ...     # Process ID as argument
input      = $(Process).in      # optional file depends on Process ID
output     = $(Process).out
queue 4                         # submit 4 jobs with Process ID 0..3
.
.

 

where ‘myproc.sh’ is a normal shell script which can be run under normal ssh terminal sessions.
To submit jobs, simply use the condor_submit command like

# condor_submit sample.condor

No matter how many jobs are submitted, each user can have at most 5 jobs executed at the same time.

Sample condor demo files can be found at /public/condor_demo

Frequently used HTCondor commands

Job submission:   /usr/bin/condor_submit
Job enquiry:      /usr/bin/condor_q -nobatch
Job removal:      /usr/bin/condor_rm {Job ID}
HTCondor Status:  /usr/bin/condor_status

For detailed HTCondor references, please refer to the link http://research.cs.wisc.edu/htcondor/

For any queries, please send an email to support[at]cs.cityu.edu.hk