GPU support for the training of DeepTau on TOpAS
Research conducted by the Karlsruhe Institute of Technology (KIT).
DeepTau overview
Tau leptons are of great interest in particle physics, particularly because their hadronic decay involves many particles, making it quite complex. Quark jets, gluon jets, muons, and electrons can often be misidentified as hadronic taus. To address this issue, neural networks (NN) are employed to reduce misidentifications. Among the neural networks, DeepTau is notable for utilizing low-level inputs alongside the usual high-level metrics. Currently, DeepTau V3 is in development and has been a primary focus of this collaboration.
The goal of this cooperation was to run the DeepTau NN training on TOpAS (Throughput Optimized Analysis System). TOpAS provides a significant amount of GPU processing power.
The GPUs available for DeepTau training were limited to a few local P100 NVIDIA GPUs. However, the newer hardware of TOpAS will help accelerate the training process. Additionally, the increased quantity of GPUs on TOpAS will enable multiple concurrent training sessions.
Hyperparameter search and scaling up
Up to millions of weight and bias parameters are continuously adjusted during NN training. Hyperparameters are a different kind of parameters that govern how the training is performed:
- Layout of the NN architecture
- Dropout ratio
- Learning rate
- Train-test split ratio
Algorithms exist to optimize these external parameters
- Grid search
- Random search
- Bayesian optimization
- Evolutionary optimization
Many iterations of very similar trainings have to be performed for all approaches. Many of these trainings can be performed concurrently during the optimization, as TOpAS provides the possibility to perform such a search in a reasonable timeframe.
Limiting factors
Limiting factors for the training are:
- input pipeline
- memory leaks
The training process was hindered by the speed of data input, as the GPU training outpaced the preprocessing of input data. To resolve this bottleneck, the input pipeline was completely rewritten. Now, instead of preprocessing files at runtime, processed training files are stored in advance, ensuring a smoother and faster training experience.
After replacing the input pipeline, we noticed that the majority of runtime was consumed by a common issue with small batch sizes. Unfortunately, increasing the batch size wasn’t an option due to insufficient device memory, a limitation that didn’t affect less powerful hardware.
We also discovered memory leaks caused by improper software defaults. TensorFlow was attempting to assign every installed CPU core to each input file, resulting in over 30,000 threads. By addressing this, we were able to increase the batch size by a factor of seven.
Although this issue existed previously, it only became apparent with the recent scale increase.
Profiling
Profiling is the crucial part both for finding bottlenecks and identifying software issues. With the help of the profiling usage, multiple issues were discovered and subsequently solved.
Optimization results
After the work performed to optimize the training process, the following results have been achieved:
- Training performance was improved greatly
- Full dataset can be used without memory issues now
- Training was sped up by an order of magnitude
However, Such an adaptation is difficult for small-scale ML.
Future work
The key focus of the 2024 FIDIUM project extension will be:
- Efficiency of GPU cluster. Running GPU cluster efficiently is the primary goal, which is vital for both performance and sustainability.
- Dynamic sharing. We aim to develop efficient methods for dynamically sharing GPUs among multiple tasks while ensuring task isolation.