Convolution neural network is widely used in various fields. The convolution layer is the core layer of the convolution neural network. The back propagation speed of the convolution layer will directly affect the training speed of the whole network, thus affecting the whole performance. For the convolution layer with stride ≥ 2, the error transmission phase of back propagation will carry out a large amount of padding in the feature graph, resulting in a large amount of additional overhead in access and calculation. In this case, we propose a new optimization method, which can reduce the overhead caused by padding to almost zero, and implement it by implicit convolution on domestic heterogeneous platforms. The experiment shows that the performance of the operator optimized by this method is nearly 50% higher than that of the original operator of the platform, and the average performance reaches 90% of that of NVIDIA V100 operator.
Convolutional Neural Network (CNN) has always been a hot topic in deep learning. With the increasing demand for network models in daily production, the optimization of convolution calculation process is very important. This paper starts with the process of back propagation in convolutional neural network, introduces the derivation of convolutional neural network back propagation and the conversion process of im2col, uses implicit to convert the calculation of convolution on the domestic acceleration platform, and optimizes the convolution back propagation operator through a variety of general matrix multiplication optimization strategies. The final performance reaches more than 70% of the performance of NVIDIA operator, which meets the expectation of the experiment under the performance bottleneck of the platform.
In advanced High-Performance Computing (HPC), convolution operations take a big proportion in convolutional neural networks, and convolutional neural networks very common in image and video based deep learning applications, because of which, this paper takes improving the performance of convolution operation as the research direction. Convolution can be performance in many ways, such as using mathematical definition to calculate, conversing to Fast Fourier Transform (FFT), conversing to batch matrix multiplication (im2col) or using Winograd algorithm. For small filter, Winograd has unique advantages. AMD based ROCm environment, the implementation of Winograd and an optimization method of Winograd based on multi-thread communication algorithm are introduced in this paper. For the Winograd convolution in ROCm 2.9.0, the speed of the algorithm was increased by more than 150% after optimization in this paper. Under some certain computing power situations, the performance of the optimization algorithm approaches or even exceeds cuDNN and MIOpen.
Large-scale network models based on transformer architecture have strong versatility in many fields. Due to the computation-intensive and large-scale characteristics of the model, large-scale training on domestic heterogeneous accelerators is restricted by aspects such as computing and communication efficiency, resulting in poor training performance. Aiming at this problem, the hot functions and performance bottlenecks in the training process are studied and analyzed, and the corresponding performance optimization methods are proposed based on the hardware characteristics of domestic heterogeneous accelerators. In order to solve the problem of low performance in low accuracy training, the low accuracy package optimization is carried out for the underlying matrix multiplication core operator. To solve the problem of significant startup delay of kernel function caused by fine-grained core operators, the LightSeq framework is transplanted on the domestic heterogeneous platform for the first time, and the core fine-grained operators are specially optimized to adapt to the hardware structure according to the characteristics of network structure to accelerate the training process. In large-scale training, in order to solve the problem of low bandwidth in cross node communication, distributed communication optimization is carried out from two levels of data transmission and hardware topology, and communication efficiency is improved by reducing the frequency of communication and increasing the communication bandwidth. The experimental results show that using the WMT '14 English-German translation dataset, the performance is improved by two times after optimization on a single node without loss of training accuracy. The computing scale is gradually expanded to 128 nodes (512 accelerator cards) for large-scale distributed training and verification. Under the premise of ensuring performance improvement, the scalability can reach more than 90% in 256 accelerator cards.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.