PHYSnet cluster integration

Research conducted by the University of Hamburg.

 

In research area III devoted to adaptation, testing, and optimization, we have set the following goals:

  • deploy tools developed within FIDIUM to selected computing centers
  • integrate into production/analysis environments of HEP experiments
  • optimize to requirements for typical analysis workflows

PHYSnet cluster overview

PHYSnet cluster contains resources for computing shared by all institutes of physics faculty. It contains:

  • heterogeneous, multiple pools/queues for diverse applications: idefix.q, infinix.q, obelix.q, epyx.q, graphix.q
  • parts reserved for exclusive use by various project groups, thus providing high flexibility for tailoring to individual/group use cases.

To use these resources for HEP, adaptation using containerization technologies and transparent integration into HEP-specific infrastructure is required.

Current setup vs ideal setup

As a perfect setup, we see transparent integration of compute resources from third-party sites into a single “overlay batch system”.

The current setup is a working small-scale setup deployed at PHYSnet for testing. It includes:

  • small dedicated HTCondor instance with schedd running on general purpose “compile node” as a central manager
  • drones submitted to local SGE batch system as long-running jobs with startd running inside drones and connecting to other HTCondor daemons
  • CernVM-File System (CVMFS) mounted in userspace using cvmfsexec
  • all components are running without elevated privileges.

In the current setup, container sources:

  • are unpacked from container images taken from /cvmfs/unpacked.cern.ch. For drones, htcondor-wn image has been developed by KIT. For job containers, standard CMS CentOS 7 image cc7-cms is used.
  • htcondor-wn provides flexibility to dynamically reconfigure drones. ansible and condor-git-config are used to reconfigure HTCondor without needing to restart container.

The image below shows both preferred and current setups:

Workflows for first large-scale tests

Several workflows listed below have been studied in preparation of the first large-scale test:

  • simple file transfer from/to grid storage elements via gfal2 libraries and X.509 authentication was tested. It works without problems, and was used to benchmark file transfer to various grid sites.
  • typical EDM file processing with CMS software framework CMSSW has been checked. We have determined that precompiled user code can run inside drones using CMS-specific containers. Balance between I/O-intensive (e.g. calibration) and CPU-intensive (analysis) tasks has been found.
  • fully orchestrated workflows using modern columnar-based analysis tools were tried. Run-3 CMS analyses based on NanoAOD in development at UHH & can be used for first studies. Workflow management tools (e.g. luigi/law) are leveraging HTCondor for job submission.

Understanding resource needs for columnar analysis

The ttbar reconstruction use case is challenging due to large jet combinatorics [O(3Njet) possible assignments per event]. Integrated hooks are needed in workflow for profiling memory allocations.

Future work

Among planned developments, we can name:

  • dedicated host(s) for essential services, such as HTCondor scheduler, specific scratch space/cache, monitoring
  • site-wide CVMFS installation.

Further development of the FIDIUM project can include:

  • drone management with CoBalD/TARDIS
  • prototype of federated dCache instance
  • integration into overlay batch system @ NAF.

Note that working on FIDIUM extension includes collaboration with other sites.

Future developments can be represented as a diagram below:

Automation with COBalD/TARDIS

The goal of working on automation with COBalD/TARDIS is providing on-demand provisioning of resources based on cluster use metrics. COBalD/TARDIS deployment provides:

  • resource integration
  • plugins that provide access to external services
  • provision resources to batch system users
  • control dynamic provisioning of resources.

Cookie Consent mit Real Cookie Banner