This compute cluster is restricted to Dr. Taufer’s students.
It was reinstalled in Fall 2017 with Ubuntu 16.04.
It is available as geronimo.gcl.cis.udel.edu
.
Machine | Features |
---|---|
geronimo | head node |
node0 | 4x C2050 |
node1 | 4x C2050 |
node2 | 4x K40 |
The stock MPI was not playing nicely with the SLURM on the cluster. A custom MPI was installed that works better.
You can use the module
command to start using it: module load openmpi
Use module avail
to see what versions are available. 3.0.0 was installed initially, but more may be added in the future.
Nvidia CUDA 8.0 and 9.0 are installed under /usr/local/cuda-8.0
and /usr/local/cuda-9.0
, respectively.
/usr/local/cuda
is a symlink which points to one of the installed versions. This is known as the «default» CUDA.Use of the SLURM workload manager/job queueing system is recommended.
The following partitions are available
Partition | Purpose |
---|---|
geronimo | default partition, get whatever node is available |
C2050 | request node(s) that contain 2050 GPUs |
K40 | request the node that contains K40 GPUs |
Make sure that /usr/local/cuda/bin
is added to your PATH
(or /usr/local/cuda-$VERSION/bin
if you want a non-default CUDA version)
Then you will have, for instance, nvcc
Some software (such as the Python package Tensorflow) require CUDA libraries be available at run-time
Make sure that /usr/local/cuda/lib64
is added to your LD_LIBRARY_PATH
environment variable.
GPUs are exposed by SLURM as a General Resource.
Example (using 2 GPUs):
srun -p geronimo --gres=gpu:2 ./myjob
Example (using 4 K40s) and getting a shell to run other jobs:
srun -p K40 --gres=gpu:4 --pty bash
See SLURM documentation on General Resources for more.
The older motherboards in this cluster are only capable of addressing up to 4GB of GPU memory, whereas the K40 GPUs contain 12GB! The typical outcome is that the K40s are not recognized at all. What we have done instead is boot the Linux kernel with a flag that allows for the detection of just two of the four K40s.
These two K40s are usable, but:
A script is provided so researchers can reboot the node without system administrator input: node2-power.sh
node2-power.sh cycle
will power cycle node2. Run with no arguments to get more detailed usage.
The node should automatically be added back to SLURM after it boots back up.
ECE/CIS • University of Delaware — All Rights Reserved • Newark, DE 19716 • USA • 2015 • Website by AndrĂ© Rauh • Maintained by Labstaff
Comments • Contact Us • Accessibility Notice • Legal Notices