The implementations of the convolution operation in neural networks are usually based on convolution-to-GeMM (General Matrix Multiplication) transformation. However, this transformation requires a big intermediate buffer (called im2col or im2row), and its initialization is both memory and time-consuming. To overcome this problem, one may use the Indirect Convolution Algorithm. This algorithm replaces the im2row buffer with a much smaller buffer of pointers, called indirection buffer. However, it limits our flexibility in the choice of multiplication micro-kernel, making matrix multiplication slightly less efficient than in the classical GeMM algorithm. To overcome this problem, we propose the Almost Indirect Convolution Algorithm, which initializes small specifically ordered block of values, which is used in matrix multiplication, via indirection buffer, the same way GeMM Algorithms initializes one block from im2row buffer. Our approach allows us to combine computational efficiency and flexibility in shape of GeMM micro-kernels with a small memory footprint of the Indirect Convolution Algorithm. Experiments with convolutions of 8-bit matrices on ARM processors show that our convolution works 14-24% faster than Indirect for a small number of channels and 10-20% faster than classical GeMM-based. This proves that it is perfectly suitable for computing inference of 8-bit quantized networks on mobile devices.
|