Implementation of COBalD/TARDIS in Wuppertal

Enhancing resource integration and efficiency with COBalD/TARDIS

Research conducted by the University of Wuppertal.

Our work in TA 3 was focused on the COBalD/TARDIS setup in Wuppertal. We have defined the following goals:

Make (local) Slurm resources available to GridKa
Simplify the setup process
Provide a test environment
Integrate accounting

We can summarize, that our COBalD/TARDIS setup is now running ATLAS jobs stably and efficiently. Integration with the AUDITOR system has been successfully completed, while work on APEL accounting is still in progress. We’ve implemented Panda accounting through a local override in the pilot script wrapper, and preliminary comparisons between AUDITOR and Panda show a strong agreement in results.

Now we are going to describe our achievements in details.

COBalD/TARDIS setup

In the COBalD/TARDIS setup, we used:

Docker Compose. This tool houses multiple COBalD/TARDIS instances for different drone types and optionally self-hosted HTCondor CM, CCB, Sched tools for testing.
Plugins. The setup accepts Plugins, handled by the build.sh/start.sh scripts. We have plugins for monitoring COBalD/TARDIS instances and drones, AUDITOR with HTCondor collector, drone Watchdog (work in progress).

Drones

Apptainer containers serve as drones in our setup, with container images generated during the build.sh process. Since pilot jobs also initiate containers, nested containers are required. We’ve activated user namespaces while disabling network namespaces. Additionally, we monitor and adjust system limits as needed, including thread and file descriptor counts, to ensure smooth operation.

Plugins

Plugins used for monitoring:

Telegraf-Influx-Grafana Stack
Data from COBalD/TARDIS logging sent via UDP handler
Drone-local script monitors usage of the drone like CPU, memory, Thread count
Drone Watchdog sends job information

AUDITOR plugins:

Plugin for AUDITOR, Postgres, and the HTCondor collector

Drone Watchdog

Drone Watchdog is a Python program running in each drone. This program:

Monitoring. It watches Inotify events for Startd history and reacts to failed jobs by marking drone unhealthy.
Notifying. It sends job information to monitoring.
Cleanup. It drains aged drones to prevent zombies.
CVMFS availability check. It monitors CVMFS availability and sets drone health accordingly.

Testing environment

The testing environment setup features:

usage of Vagrant and Ansible to build virtual cluster
Slurm, BeeGFS, and CernVM-FS installed automatically
pulls in COBalD/TARDIS setup as submodule

Adapting to external resource limitations

ATLAS jobs may fail when bandwidth is limited, so we needed a solution to manage new drones that rely on external resources.
Besides, we’ve added a new decorator to COBalD. This feature listens for „demand hints“ on a specified TCP port and adjusts the demand accordingly, helping to prevent resource bottlenecks and improve job success rates.

Future work

Our plans for the future development include:

Consolidate the existing setup so that it can smoothly be used by other parties
Create a “How to include a site” guide
Contribute to and test features for AUDITOR and plugins