Performance test results of the ATLAS jobs on the integrated NHR Emmy HPC

Performance of the ATLAS jobs on the integrated NHR Emmy HPC

Research conducted by the University of Goettingen.

In TA 3, we were working on estimating the performance of the ATLAS jobs on the integrated NHR Emmy HPC. Below we are providing more detailed information on the performed work and obtained results.

Benchmarking

For benchmarking, we used the HEPScore23 software, which encompasses workloads from key HEP applications like ATLAS, CMS, LHCb, ALICE, and BELLE2.

On the EMMY system, with hyper-threading enabled, we achieved approximately 1,900 HS23 across 192 cores, resulting in a per-thread performance of 9.89 HS23. These results were extrapolated with a 2% uncertainty.

On GoeGrid, also with hyper-threading enabled, the performance reached about 2,380 HS23 across 256 cores, delivering a per-thread performance of 9.29 HS23. This comparison highlights the efficiency and capacity of both systems under the HEPScore23 benchmark.

Test results

The obtained test results comprise the data of ATLAS jobs from Panda Queue running for 30 days period (August 19th – September 18th, 2023). The jobs used were mostly of the simulation type, including such types as simulation, pileup, evegen, deriv, etc.

The plots offer a comparative analysis of completed jobs on the NHR cluster (EMMY) and the Tier-2 cluster of GoeGrid, normalized to 100%. These jobs were executed within drones that have a 12-hour lifespan, meaning the drones are not continuous.

In the first large-scale test, approximately 76,000 core hours were utilized across around 6,300 cores over the 12-hour period. Additional plots are provided, weighted according to CPU consumption, to give a more detailed insight into resource usage.

Detailed data on test results:

Emmy job status. ATLAS jobs ran on Emmy over the 30 days period (19.08 – 18.09.2023) and had a maximum wall time of 12 hours. The wall time has to be adjusted.
CPU efficiency:
- Simulating ATLAS jobs. CPU efficiency was compared for ATLAS jobs between Emmy and GoeGrid, also it was estimated for NHR EMMY as a function of walltime cores.
- Pile up ATLAS jobs. CPU efficiency was compared for ATLAS jobs between Emmy and GoeGrid, also it was estimated for NHR EMMY as a function of walltime cores. It was determined that the amount of resources spent on half CPU Efficient jobs is very low.
  Output produced by jobs with ATLAS simulation and ATLAS Fast Simulation has been analyzed. As far as it has been determined, ATLAS Full Simulation is CPU-intensive, while ATLAS Fast Simulation requires much less CPU power. Note that the investigation is still in progress.
- Mixed ATLAS jobs. Wall time comparison between EMMY and GoeGrid has been performed. On Emmy, maximal wall time was restricted to 12 hours to match the drone life time.
Input/Output intensity. IO intensity was compared for simulation jobs between Emmy and GoeGrid.

After completing the testing we can say that the results for Emmy look comparable to GoeGrid Tier – 2.

Future work

In the future work on FIDIUM we plan to:

Test. We plan to perform tests on ATLAS jobs running regularly. Also tests on data access via local dCache vs remote mass data storage vs remote data lakes are planned.
For more accurate results, the same jobs should be tested on both clusters.
We plan to conduct frequent large-scale tests to gather more data for comparisons. These tests will help us assess how the performance of the file system is affected by the number of running jobs.
Extend job runtime. Job runtime is going to be increased up to 48 hours instead of 12 hours.