Testing future infrastructure designs for efficiency in simulation
Research conducted by the Karlsruhe Institute of Technology (KIT).
In our work on Research Area II, focused on data lakes, distributed data, and caching, we completed the search for and deployment of a flexible, high-performance, and accurate simulation tool. This effort also served as a preparation for practical research into innovative and realistic computing infrastructures.
When planning the future computing infrastructure for High Energy Physics (HEP), we considered a range of complex requirements, including:
- Analysis, reconstruction, and simulation on the Worldwide LHC Computing Grid (WLCG).
- A dynamic, distributed infrastructure involving many components, variable workloads, data access, resource availability, and complex scheduling systems.
- Limited resources for infrastructure modernization.
After thoroughly analyzing these needs and constraints, we implemented a new simulation infrastructure. Previously, the MONARC simulation network was used, but discontinued. We have therefore chosen to implement a modern simulation framework, DCSim, based on the SimGrid and WRENCH open-source simulation toolkits.
- SimGrid offers low-level simulation abstractions for distributed systems, modeling network, storage, and CPU resources with a fluid approach.
- WRENCH, built on SimGrid, provides high-level tools and services for defining and managing activities within the simulation.
We have also implemented HEP-specific adaptations, such as job, dataset, and workflow models, data streaming logic, and service management capabilities. These extensions are available as open-source code.
With these, we have implemented the DCSim simulator that is able to simulate executions of computing workflows on distributed infrastructures. With a wide range of flexibility, users of DCSim are free to:
- Define job workloads, including operations, memory requirements, input datasets, and output sizes.
- Specify the platform parameters, such as CPU speed, RAM, disk space, and network roles.
- Deploy storage systems and allocate files.
- Run the simulation: jobs are scheduled, inputs are streamed and cached, and cache management is handled dynamically.
Free parameters of the simulation models have been calibrated to align with real-world systems‘ behaviour in HEP computing. Predictions made by the calibrated simulator have been validated to reproduce independent real-world measurements.
This work represents a significant step forward in developing future-ready computing infrastructures for HEP in a reliable and systematic way, combining flexibility, efficiency, and accuracy in a simulation environment.
Future work
Looking ahead, we have plans for further enhancing our simulation capabilities. We will focus on improving the speed of simulations, for instance utilising machine-learning surrogate models, automating calibration and validation processes, and adding uncertainty estimations for parameters, calibration, and models.
Additionally, if needed, we will develop simulations that track energy consumption, helping us better understand and optimize resource usage.
Prime application examples for the simulator are to study data-aware scheduling systems that intelligently account for cached data locality in future grid computing infrastructures.