Currently, quantized neural networks have been widely applied in edge device model inference. In model inference using FPGAs, convolution operations are typically implemented using DSP that can provide 27×18bits operations. However, there is no general method for fully utilizing DSP to compute arbitrary low-precision quantized data in parallel. In the paper, we propose MDP, a universal solution that packs multiple multiplications in one DSP based on quantization precision and convolution parameters in multiple modes. To maximize the DSP multiplication performance, we design a data pipeline architecture based on GEMM. The experimental results show that, for single convolutional layer, MDP can decrease DSP resource utilization by 0.78× while maintaining the same level of parallelism. Compared to UltraNet, MDP achieves 1.56× latency improvement.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.