Geronimo Cluster

Overview

This compute cluster is restricted to Dr. Taufer’s students. It was reinstalled in Fall 2017 with Ubuntu 16.04. It is available as geronimo.gcl.cis.udel.edu.

Machine	Features
geronimo	head node
node0	4x C2050
node1	4x C2050
node2	4x K40

Software

The stock MPI was not playing nicely with the SLURM on the cluster. A custom MPI was installed that works better. You can use the module command to start using it: module load openmpi Use module avail to see what versions are available. 3.0.0 was installed initially, but more may be added in the future.
Nvidia CUDA 8.0 and 9.0 are installed under /usr/local/cuda-8.0 and /usr/local/cuda-9.0, respectively.
- /usr/local/cuda is a symlink which points to one of the installed versions. This is known as the «default» CUDA.

Submitting Jobs

Use of the SLURM workload manager/job queueing system is recommended.

The following partitions are available

Partition	Purpose
geronimo	default partition, get whatever node is available
C2050	request node(s) that contain 2050 GPUs
K40	request the node that contains K40 GPUs

Examples

Compiling CUDA Programs

Make sure that /usr/local/cuda/bin is added to your PATH (or /usr/local/cuda-$VERSION/bin if you want a non-default CUDA version)

Then you will have, for instance, nvcc

Loading CUDA Libraries

Some software (such as the Python package Tensorflow) require CUDA libraries be available at run-time

Make sure that /usr/local/cuda/lib64 is added to your LD_LIBRARY_PATH environment variable.

Requesting GPUs

GPUs are exposed by SLURM as a General Resource.

Example (using 2 GPUs):

srun -p geronimo --gres=gpu:2 ./myjob

Example (using 4 K40s) and getting a shell to run other jobs:

srun -p K40 --gres=gpu:4 --pty bash

See SLURM documentation on General Resources for more.

node2 Instability

The older motherboards in this cluster are only capable of addressing up to 4GB of GPU memory, whereas the K40 GPUs contain 12GB! The typical outcome is that the K40s are not recognized at all. What we have done instead is boot the Linux kernel with a flag that allows for the detection of just two of the four K40s.

These two K40s are usable, but:

attempting to use more than 4GB of GPU memory will crash the GPU(s)
the system cannot recover the GPUs until a reboot occurs

Rebooting node2

A script is provided so researchers can reboot the node without system administrator input: node2-power.sh

node2-power.sh cycle will power cycle node2. Run with no arguments to get more detailed usage.

The node should automatically be added back to SLURM after it boots back up.