This compute cluster is restricted to Dr. Taufer’s students.
It was reinstalled in Fall 2017 with Ubuntu 16.04.
It is available as
The stock MPI was not playing nicely with the SLURM on the cluster. A custom MPI was installed that works better.
You can use the
module command to start using it:
module load openmpi
module avail to see what versions are available. 3.0.0 was installed initially, but more may be added in the future.
Nvidia CUDA 8.0 and 9.0 are installed under
/usr/local/cudais a symlink which points to one of the installed versions. This is known as the «default» CUDA.
Use of the SLURM workload manager/job queueing system is recommended.
The following partitions are available
|geronimo||default partition, get whatever node is available|
|C2050||request node(s) that contain 2050 GPUs|
|K40||request the node that contains K40 GPUs|
Make sure that
/usr/local/cuda/bin is added to your
/usr/local/cuda-$VERSION/bin if you want a non-default CUDA version)
Then you will have, for instance,
Some software (such as the Python package Tensorflow) require CUDA libraries be available at run-time
Make sure that
/usr/local/cuda/lib64 is added to your
LD_LIBRARY_PATH environment variable.
GPUs are exposed by SLURM as a General Resource.
Example (using 2 GPUs):
srun -p geronimo --gres=gpu:2 ./myjob
Example (using 4 K40s) and getting a shell to run other jobs:
srun -p K40 --gres=gpu:4 --pty bash
See SLURM documentation on General Resources for more.
The older motherboards in this cluster are only capable of addressing up to 4GB of GPU memory, whereas the K40 GPUs contain 12GB! The typical outcome is that the K40s are not recognized at all. What we have done instead is boot the Linux kernel with a flag that allows for the detection of just two of the four K40s.
These two K40s are usable, but:
A script is provided so researchers can reboot the node without system administrator input:
node2-power.sh cycle will power cycle node2. Run with no arguments to get more detailed usage.
The node should automatically be added back to SLURM after it boots back up.