Chestnut Shared HPC Cluster update

The Chestnut Cluster now has fully updated Infiniband network drivers across all compute nodes, all of which are functional once again (a couple were offline due to hardware issues which have been resolved). As a reminder, use the -C “[ib1|ib2|ib3|ib4]” option to salloc, sbatch, or srun in order to ensure that all nodes within each job are on the same Infiniband network switch; otherwise, your job will fall back to the slower Ethernet network between nodes.

We have also enabled the standby partition across the cluster. Currently it does not have any special restrictions. If you submit a job to the standby partition, your job has the opportunity to run on anybody’s idle nodes. However, if at any time during your standby partition job, the node owner runs a job, then your job will get kicked off and requeued.

The scheduler no longer has a default partition. You must explicitly choose which partition to run your job in, be it your group’s partition or the standby partition. Use the -p <partition_name> option for this purpose.

There was a problem with some user directories’ quotas. Some users saw that they were not using their entire 100GB in their home directory, but still getting an error message that the filesystem was full. This issue has been remedied as of the update.

A temporary file cleaner is now running nightly to remove files older than 31 days from inside /scratch-local on all nodes, including the login node.