Enhancing resource integration and efficiency with COBalD/TARDIS
Research conducted by the University of Wuppertal.
Our work in TA 3 was focused on the COBalD/TARDIS setup in Wuppertal. We have defined the following goals:
- Make (local) Slurm resources available to GridKa
- Simplify the setup process
- Provide a test environment
- Integrate accounting
We can summarize, that our COBalD/TARDIS setup is now running ATLAS jobs stably and efficiently. Integration with the AUDITOR system has been successfully completed, while work on APEL accounting is still in progress. We’ve implemented Panda accounting through a local override in the pilot script wrapper, and preliminary comparisons between AUDITOR and Panda show a strong agreement in results.
Now we are going to describe our achievements in details.
COBalD/TARDIS setup
In the COBalD/TARDIS setup, we used:
- Docker Compose. This tool houses multiple COBalD/TARDIS instances for different drone types and optionally self-hosted HTCondor CM, CCB, Sched tools for testing.
- Plugins. The setup accepts Plugins, handled by the build.sh/start.sh scripts. We have plugins for monitoring COBalD/TARDIS instances and drones, AUDITOR with HTCondor collector, drone Watchdog (work in progress).
Drones
Apptainer containers serve as drones in our setup, with container images generated during the build.sh process. Since pilot jobs also initiate containers, nested containers are required. We’ve activated user namespaces while disabling network namespaces. Additionally, we monitor and adjust system limits as needed, including thread and file descriptor counts, to ensure smooth operation.
Plugins
Plugins used for monitoring:
- Telegraf-Influx-Grafana Stack
- Data from COBalD/TARDIS logging sent via UDP handler
- Drone-local script monitors usage of the drone like CPU, memory, Thread count
- Drone Watchdog sends job information
AUDITOR plugins:
- Plugin for AUDITOR, Postgres, and the HTCondor collector
Drone Watchdog
Drone Watchdog is a Python program running in each drone. This program:
- Monitoring. It watches Inotify events for Startd history and reacts to failed jobs by marking drone unhealthy.
- Notifying. It sends job information to monitoring.
- Cleanup. It drains aged drones to prevent zombies.
- CVMFS availability check. It monitors CVMFS availability and sets drone health accordingly.
Testing environment
The testing environment setup features:
- usage of Vagrant and Ansible to build virtual cluster
- Slurm, BeeGFS, and CernVM-FS installed automatically
- pulls in COBalD/TARDIS setup as submodule
Adapting to external resource limitations
ATLAS jobs may fail when bandwidth is limited, so we needed a solution to manage new drones that rely on external resources.
Besides, we’ve added a new decorator to COBalD. This feature listens for „demand hints“ on a specified TCP port and adjusts the demand accordingly, helping to prevent resource bottlenecks and improve job success rates.
Future work
Our plans for the future development include:
- Consolidate the existing setup so that it can smoothly be used by other parties
- Create a “How to include a site” guide
- Contribute to and test features for AUDITOR and plugins