Chestnut Shared HPC Cluster update

The Chestnut Cluster now has fully updated Infiniband network drivers across all compute nodes, all of which are functional once again (a couple were offline due to hardware issues which have been resolved). As a reminder, use the -C “[ib1|ib2|ib3|ib4]” option to salloc, sbatch, or srun in order to ensure that all nodes within each job are on the same Infiniband network switch; otherwise, your job will fall back to the slower Ethernet network between nodes.

We have also enabled the standby partition across the cluster. Currently it does not have any special restrictions. If you submit a job to the standby partition, your job has the opportunity to run on anybody’s idle nodes. However, if at any time during your standby partition job, the node owner runs a job, then your job will get kicked off and requeued.

The scheduler no longer has a default partition. You must explicitly choose which partition to run your job in, be it your group’s partition or the standby partition. Use the -p <partition_name> option for this purpose.

There was a problem with some user directories’ quotas. Some users saw that they were not using their entire 100GB in their home directory, but still getting an error message that the filesystem was full. This issue has been remedied as of the update.

A temporary file cleaner is now running nightly to remove files older than 31 days from inside /scratch-local on all nodes, including the login node.

Chestnut Shared HPC Cluster update prep

Ran into some problems while prepping for the update, Infiniband drivers were not recompiling for the latest secure kernel. We had to download the latest version of the Infiniband packages and recompile for the latest kernel, as the previous Infiniband packages broke on the recompilation step. Now testing compile and install steps…

[Edited]

Successfully compiled with newer packages, will be updating IB drivers during Tuesday’s outage window.

Upcoming Chestnut Shared HPC Cluster update

Please check your e-mail inbox for full details on the upcoming Chestnut Shared HPC Cluster update, on Tuesday 2/20/2018 at 8AM.

Highlights:

1. Storage updates (**SEE DATA REMOVAL WARNING REGARDING /scratch-local/ IN E-MAIL **)
2. You must now specify partition for each job
3. New “standby” partition