Setups at Karlsruhe Institute of Technology

Enhancing resource provisioning and sustainable computing

Research conducted by the Karlsruhe Institute of Technology (KIT).

By deploying the setups and tools described below, we are enhancing the efficiency, sustainability, and accessibility of computing resources across various research communities.

Simplifying resource provisioning

At KIT, we have simplified provisioning and utilization of third-party compute resources for various communities through:

Dynamic integration. Seamless, on-demand integration has been achieved via COBalD/TARDIS.
Unified entry points. Consistent access to diverse resources (HPCs, Clouds, etc.) is provided.
Production scale operation. Scale tests with HoreKa (KIT HPC cluster) were successful.
Production deployment. It is coordinated by KIT/GridKa across HEP institutes and HPC resources.
Compute4PUNCH infrastructure. A key component within PUNCH4NFDI.
Additional deployments. Similar setups at CLAIX HPC (RWTH Aachen) have been deployed, and are currently being deployed at Emmy (University of Göttingen).

Lancium is no longer used

We enabled access to sustainable computing resources via Lancium – a tool developed by a US company. This tool balanced the power grid by operating compute facilities near renewable energy sources, achieving CO2-neutral operation. Lancium could have been integrated with COBalD/TARDIS, was used for ATLAS/CMS MC generation, and proved successful in the „Proof of Concept“ project. However, Lancium exited the PaaS business in April 2023.

Current COBalD/TARDIS ecosystem

The current COBalD/TARDIS ecosystem includes:

container stacks (wlcg-wn, htcondor-wn)
COBalD
TARDIS
HTCondor.

When integrating new resources, we prioritize leveraging existing tools to handle tasks, continuously improving them to meet new challenges. New resources are integrated only when existing tools are insufficient.

AUDITOR at KIT

The AUDITOR accounting ecosystem setup at KIT includes:

AUDITOR instance with a PostgreSQL database
HTCondor collector
APEL-plugin as systemd services per CE. Each CE reports individually to APEL.

This setup is currently being tested with opportunistic resources from the University of Bonn.

HTCondor Collector

KIT contributed the HTCondor collector to th

e AUDITOR system. The collector features:

Periodically querying HTCondor history and sending job records to AUDITOR.
Single .yaml file for is needed for setup.
Comprehensive documentation.

Ongoing improvements include usage of HTCondor Python bindings and continuous support.

Troubleshooting

FIDIUM tools are designed for all Grid-enabled communities. So, if your collaboration is not Grid-enabled, the setup probably will not work for your experiment. In this case, you can gain access via a traditional login node or JupyterHub, or use our technology to Grid-enable your experiment.

Future work

We plan to enhance the economical and energy-efficient utilization of GPU resources by improving the scheduling of GPU jobs. It can be achieved by multiple jobs on GPUs using methods like Nvidia’s MIG (Multi-Instance GPU) or MPS (Multi-Process Service), introduction of checkpoints for ML GPU workflows, and optimization of the CPU core to GPU ratio.

We also plan to enable opportunistic/pledged utilization of GPU resources in HPC clusters, for example, HoreKa@KIT.

This work can be divided into two milestones:

Dynamic and transparent integration of HPC GPU resources
Documentation and finalization of developments.