JL Cluster decommissioning roadmap

The JL Cluster compute nodes and job scheduler will be shutdown on June 1, 2018. At that time, you will still be able to login to the head node.

The JL Cluster shared storage will become read-only on June 1, 2018. We will make a final backup of the data at that time, to be owned by Dr. Lukes. That backup will lose any special files (symbolic links are the most common user-owned special file).

The JL Cluster will be shutdown on June 30, 2018. Prior to that date, you may copy out any important data that you wish to keep for future work.

 

NLP Grid update

The NLP Grid had been experiencing an interactive prompt slowdown which affected the login node and all interactive sessions. This has been remedied.

Walnut Shared HPC Cluster decommissioning roadmap

The Walnut cluster was first available for use approximately five years ago, in May 2013.  This means that there is only about one year left in its planned six-year lifetime.  We plan to decommission the Walnut cluster on May 31, 2019.

Please start thinking about migrating your work and your data to other systems.  If you would like advice and/or help with this, please contact us at research@seas.upenn.edu.  We will be sending further announcements out to all users as we get closer to the end date.

Chestnut Shared HPC Cluster update

The Chestnut Cluster now has fully updated Infiniband network drivers across all compute nodes, all of which are functional once again (a couple were offline due to hardware issues which have been resolved). As a reminder, use the -C “[ib1|ib2|ib3|ib4]” option to salloc, sbatch, or srun in order to ensure that all nodes within each job are on the same Infiniband network switch; otherwise, your job will fall back to the slower Ethernet network between nodes.

We have also enabled the standby partition across the cluster. Currently it does not have any special restrictions. If you submit a job to the standby partition, your job has the opportunity to run on anybody’s idle nodes. However, if at any time during your standby partition job, the node owner runs a job, then your job will get kicked off and requeued.

The scheduler no longer has a default partition. You must explicitly choose which partition to run your job in, be it your group’s partition or the standby partition. Use the -p <partition_name> option for this purpose.

There was a problem with some user directories’ quotas. Some users saw that they were not using their entire 100GB in their home directory, but still getting an error message that the filesystem was full. This issue has been remedied as of the update.

A temporary file cleaner is now running nightly to remove files older than 31 days from inside /scratch-local on all nodes, including the login node.

Chestnut Shared HPC Cluster update prep

Ran into some problems while prepping for the update, Infiniband drivers were not recompiling for the latest secure kernel. We had to download the latest version of the Infiniband packages and recompile for the latest kernel, as the previous Infiniband packages broke on the recompilation step. Now testing compile and install steps…

[Edited]

Successfully compiled with newer packages, will be updating IB drivers during Tuesday’s outage window.

Spring 2018 CLUNCH

Today Dan Widyono spoke at the Spring 2018 CLUNCH NLP Grid presentation / workshop. Slides will be provided to Reno to distribute to those who missed it, or those who wanted a copy for reference.

Upcoming Chestnut Shared HPC Cluster update

Please check your e-mail inbox for full details on the upcoming Chestnut Shared HPC Cluster update, on Tuesday 2/20/2018 at 8AM.

Highlights:

1. Storage updates (**SEE DATA REMOVAL WARNING REGARDING /scratch-local/ IN E-MAIL **)
2. You must now specify partition for each job
3. New “standby” partition