PHYSnet cluster integration

Research conducted by the University of Hamburg.

In research area III devoted to adaptation, testing, and optimization, we have set the following goals:

deploy tools developed within FIDIUM to selected computing centers
integrate into production/analysis environments of HEP experiments
optimize to requirements for typical analysis workflows

PHYSnet cluster overview

PHYSnet cluster contains resources for computing shared by all institutes of physics faculty. It contains:

heterogeneous, multiple pools/queues for diverse applications: idefix.q, infinix.q, obelix.q, epyx.q, graphix.q
parts reserved for exclusive use by various project groups, thus providing high flexibility for tailoring to individual/group use cases.

To use these resources for HEP, adaptation using containerization technologies and transparent integration into HEP-specific infrastructure is required.

Current setup vs ideal setup

As a perfect setup, we see transparent integration of compute resources from third-party sites into a single “overlay batch system”.

The current setup is a working small-scale setup deployed at PHYSnet for testing. It includes:

small dedicated HTCondor instance with schedd running on general purpose “compile node” as a central manager
drones submitted to local SGE batch system as long-running jobs with startd running inside drones and connecting to other HTCondor daemons
CernVM-File System (CVMFS) mounted in userspace using cvmfsexec
all components are running without elevated privileges.

In the current setup, container sources:

are unpacked from container images taken from /cvmfs/unpacked.cern.ch. For drones, htcondor-wn image has been developed by KIT. For job containers, standard CMS CentOS 7 image cc7-cms is used.
htcondor-wn provides flexibility to dynamically reconfigure drones. ansible and condor-git-config are used to reconfigure HTCondor without needing to restart container.

The image below shows both preferred and current setups:

Workflows for first large-scale tests

Several workflows listed below have been studied in preparation of the first large-scale test:

simple file transfer from/to grid storage elements via gfal2 libraries and X.509 authentication was tested. It works without problems, and was used to benchmark file transfer to various grid sites.
typical EDM file processing with CMS software framework CMSSW has been checked. We have determined that precompiled user code can run inside drones using CMS-specific containers. Balance between I/O-intensive (e.g. calibration) and CPU-intensive (analysis) tasks has been found.
fully orchestrated workflows using modern columnar-based analysis tools were tried. Run-3 CMS analyses based on NanoAOD in development at UHH & can be used for first studies. Workflow management tools (e.g. luigi/law) are leveraging HTCondor for job submission.

Understanding resource needs for columnar analysis

The ttbar reconstruction use case is challenging due to large jet combinatorics [O(3Njet) possible assignments per event]. Integrated hooks are needed in workflow for profiling memory allocations.

Future work

Among planned developments, we can name:

dedicated host(s) for essential services, such as HTCondor scheduler, specific scratch space/cache, monitoring
site-wide CVMFS installation.

Further development of the FIDIUM project can include:

drone management with CoBalD/TARDIS
prototype of federated dCache instance
integration into overlay batch system @ NAF.

Note that working on FIDIUM extension includes collaboration with other sites.

Future developments can be represented as a diagram below:

Automation with COBalD/TARDIS

The goal of working on automation with COBalD/TARDIS is providing on-demand provisioning of resources based on cluster use metrics. COBalD/TARDIS deployment provides:

resource integration
plugins that provide access to external services
provision resources to batch system users
control dynamic provisioning of resources.