Virtual extension of GoeGrid into Emmy

Research conducted by the University of Goettingen.

 

The work was focused on virtually extending the GoeGrid batch system into the Emmy system using containers, transforming HPC nodes into virtual nodes with their own job scheduling.

 GoeGrid and Emmy overview:

  • Emmy is an HLRN-IV system at the University of Göttingen. The hardware used in it includes Intel Platinum 9242 @ 2.3 GHz with 96 physical cores. For Emmy, the hyper-threading setting is enabled, and turbo boost is disabled.
  • GoeGrid is a joint grid resource center at the University of Göttingen. The hardware used in it includes CPU clusters with Intel Xeon and AMD EPYC. For GoeGrid, the hyper-threading and turbo boost (for Intel only) settings are enabled.
Key achievements in setup

The following achievements have been made and the problems solved:

  • CVMFS client installation. Because Filesystem in Userspace (FUSE) was not available on Emmy, the unprivileged CVMFS client (CVMFS Exec) was installed.
  • User namespaces. Unprivileged user namespaces were enabled.
  • High-speed connection establishment. A 4×100 Gbit/s network connection between GoeGrid and Emmy was established. 
  • Proxy. Because network connection to the outside world is not available, squid proxy is used to download software and other resources from CVMFS.
  • Whole node scheduling. HTCondor has been deployed. It helps to turn Emmy worker nodes into virtual worker nodes with dynamically partitionable slots.
  • Shared filesystem. Shared Lustre filesystem (SSD and HDD) is used for launching drones and caching CVMFS. It allows switching between SSD’s and HDD’s.
  • Job runtime limits. Now job runtime is limited to 12 hours. It is possible that in the future it may be extended to 48 hours.
  • Monitoring efficiency. Drone efficiency, resource usage, and core utilization over drone lifetime are tracked.
  • Singularity/apptainer container launch. Singularity/apptainer container launched on the local HPC (Emmy) cluster contains the job scheduling (HTCondor) system.  HTCondor daemon (Drones) turns the Emmy node into a virtual worker node of GoeGrid.
  • ATLAS jobs. ATLAS jobs are submitted via GoeGrid ARC-CEs and HTCondor batch system to the virtual nodes on Emmy. Now, regular ATLAS jobs run successfully.
  • Future COBalD/TARDIS usage. It is planned to use COBalD/TARDIS for managing virtual node provisioning based on resource needs and availability.

All the achievements listed above can be summarized in two major points:

  • Successful integration of the NHR cluster (Emmy) with the Tier-2 cluster of GoeGrid has been completed.
  • Running ATLAS jobs on Emmy regularly.

Future work

For the future development, the following goals have been set:

  • Usage of CoBalD/TARDIS for automatization of Drone launch and resource connection
  • Aim for automated load balancing on „Emmy“ (temporarily for certain job types)
  • Finish generic R&D of NHR integration into GoeGrid/WLCG
  • Smooth transition from university-based Tier-2 to NHR computing.
  1.  
Cookie Consent mit Real Cookie Banner