8 Admin Tasks

In the following, it is assumed that $CHROOT resolves to /opt/ohpc/admin/images/<version>.

8.1 Warewulf

Cluster management.

8.1.1 Sharing of directories between `edi` and nodes

On edi: Add directory into /etc/exports

# /home *(rw,no_subtree_check,fsid=10,no_root_squash)
/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)
/opt/spack *(rw,no_subtree_check,no_root_squash)
/opt/R *(rw,no_subtree_check,no_root_squash)

and call exportfs -a

For the nodes, the entry needs to go to $CHROOT/etc/fstab` like

# edi
192.168.1.1:/opt/spack /opt/spack nfs nfsvers=3,nodev 0 0

192.168.1.1:/opt/R /opt/R nfs nfsvers=3,nodev 0 0

# mars
10.232.16.12:/mnt/vol2/hpc/edi        /home   nfs     rw,sync,user,hard,intr,_netdev,exec     0       0

8.1.2 `renv` cache

The renv cache is mapped centrally to /opt/R/renv in RSW. To share the RSW cache and edi cache with the nodes, an NFS share has been added. See the previous section for more details.

8.1.3 Enable systemd services on nodes

pdsh -w c[0-5] systemctl <command>

8.1.4 Enable systemd service in image

export CHROOT=<some path>
chroot $CHROOT systemctl enable <service>

8.1.5 Updating image nodes

/root/update-nodes.sh

Sometimes munge does not start after updating the nodes, causing the nodes to be out of sync with the controller. Check systemctl status munge and eventually restart munge on all nodes:

pdsh -w c[0-5] mkdir /var/log/munge
pdsh -w c[0-5] chown -R munge:munge /var/log/munge
pdsh -w c[0-5] systemctl restart munge

scontrol update nodename=c[0-5] state=resume

In addition, permissions on /opt/R/renv should be public r+w which is sometimes also not true and causes problems in combination with renv.

pdsh -w c[0-5] chmod -R 777 /opt/R/renv

8.1.6 Sharing hostnames

The nodes must be aware of the RSW hostname and internal IP (docker0 gateway). To do so, add a hostname/IP mapping into $CHROOT/etc/hosts and reboot the nodes.

192.168.1.1 edi
172.18.0.3 rsw
172.18.0.4 rsw-docker

8.2 SLURM

Some notes:

/etc/slurm/slurm.conf must always be identical everywhere (RSW, edi, nodes)
In /etc/slurm/slurm.conf two SlurmctldHost entries are needed (one for edi, one for RSW in the container)

8.2.1 Undrain a node

If a node is in state “drain”, one can undrain it via

scontrol update NodeName=<node> State=DOWN Reason="undraining"
scontrol update NodeName=<node> State=RESUME

scontrol update nodename=c[0-5] state=resume

8.2.2 Reconfigure Slurm

E.g. after settings update

scontrol reconfigure

8.3 Docker

8.3.1 Pulling a new image

Via user admingeogr which has AWS pull credentials configured

cd /home/admingeogr/rsw
# log into AWS ECR repo
aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin 222488041355.dkr.ecr.eu-central-1.amazonaws.com
docker-compose pull

8.3.2 Update a container

cd /home/admingeogr/rsw
docker-compose up -d

8.3.3 Clean up old images

docker image prune -af

Shotts, William E. 2012. The Linux Command Line: A Complete Introduction. San Francisco: No Starch Press.

Sobell, Mark G. 2010. A Practical Guide to Linux Commands, Editors, and Shell Programming. 2nd ed. Upper Saddle River, NJ: Prentice Hall.

Ward, Brian. 2015. How Linux Works: What Every Superuser Should Know. 2nd edition. San Francisco: No Starch Press.

7 Troubleshooting and Cautionary Notes