SIMD and Vector Classes in high-energy physics
Research conducted by Goethe University Frankfurt.
SIMD (Single Instruction, Multiple Data) is a parallel computing technique that allows a single instruction to be applied to multiple data elements simultaneously. By applying one instruction to a group of data elements, known as a vector, SIMD enables efficient processing of large data sets in parallel. This approach significantly speeds up data processing tasks by handling many data points at once.
Vc (Vector Classes) is a portable, zero-overhead C++ library designed specifically for SIMD programming. It provides a seamless way to work with SIMD data types, making data-level parallel programming more accessible and efficient.
Key features of Vc include:
- Overloaded operators
- Unified code
- Easy masking
- Simple vector algorithms
The data structures used in SIMD programming include:
- Array of structures is perfect for sorting, with object-based grouping and random memory access for SIMD.
- Structure of arrays/simple arrays are good for vectorization, with variable-based grouping and sequential memory access for SIMD, though the code can be complex.
- Array of structures of arrays (SIMD Vectors) combines both approaches, offering good vectorization with sequential memory access and potential for template-based enhancements.
CBM L1 Track Finder is the main algorithm for time-based particle trajectory reconstruction in the STS and MVD detectors of the CBM experiment. It has been upgraded to work both in vector mode using Vc, and in scalar mode, which is suitable for parallelization on GPUs.
CBM L1 Track Finder:
- Offers fast and efficient 4D reconstruction.
- Cellular-automaton based.
- Highly parallelizable and vectorizable.
Template-based code and SIMD efficiency
Template-based code can be used to:
- Minimize dependency on specific implementations of SIMD header files or libraries.
- Easier software support due to no need to duplicate classes and data structures for scalar/vector code.
- Provide the ability to compile code for different platforms at the same time with different settings.
It is possible to use Vc SIMDized functions and data structures when fitting tracks for KF Particle Finder on the CPU at the same time using a non-SIMD version for tracking on the GPU.
Runtime measurements and performance
Runtime measurements that were performed showed that:
- Vectorization efficiency depends on the amount of non-SIMD operations.
- Ideal speed up is not possible due to overhead, good speed-up is achieved.
Primary assessments and comparisons of the computing efficiency of CPUs and GPUs were carried out within the AP-I. The results obtained will be useful in further studies of AP-III.
Future work
CBM L1 Track Finder
Current and future work on CBM L1 Track Finder includes:
- GPU parallelization and preparation of a common CPU/GPU version of the tracker based on the XPU framework;
- Containerization and preparation of mechanisms for efficient portable running and utilization of the event reconstruction chain in the CBM experiment.
FIDIUM development in the future
The plans for the future work on FIDIUM:
- Estimate energy consumption for processing of efficient, vectorized tracking code on many-core CPUs and GPUs
- Systematic study of dependencies:
- Input data size;
- Number of processing cores/threads;
- Vectorization (SSE and AVX intrinsic);
- Different algorithms for reconstruction;
- Hardware parameters (utilization of memory, memory bandwidth, CPU frequency).
- Effects of minimization of tracking runtimes for the actual consumption of energy in resources (for CPUs/GPUs).