Vandooren, Laura
[UCL]
Bol, David
[UCL]
De Vleeschouwer, Christophe
[UCL]
The keyword spotting system (KWS) represents a major part of human-technology interfaces. It requires low-latency and high accuracy response for good user experience. Moreover, KWS typically runs on small micro-controllers and have thus, limited memory and compute capability. To meet these constraints, recurrent neural networks, such as LSTM and Grid LSTM models, as well as a convolutional neural network are trained to determine whether a one-second audio clip contains one of the four predefined words or an unknown word. In that way, we are able to determine the model with the best memory, accuracy and computational complexity trade-off. It comes out that the CNN reaches the best accuracy nearly equal to 93% with 370 millions of operations and only, 500 thousands parameters. While, for resource-constrained problem, the 1-layer LSTM is optimal using the smallest memory space and number of operations but achieves only 86.6% of accuracy. We have also studied the effect of the signal pre-processing on the network performance and we have found that using an optimal window size of 25 ms for the spectrogram computation allows to rise the accuracy by 3%. Finally, the LSTM network is quantized to reduce the model memory size and its computational complexity. The weights and biases can be reduced to 4-bits fixed point numbers if a 3% accuracy drop is tolerated in comparison to the floating point implementation.


Bibliographic reference |
Vandooren, Laura. Comparison between convolutional and recurrent neural networks for keyword recognition applications. Ecole polytechnique de Louvain, Université catholique de Louvain, 2020. Prom. : Bol, David ; De Vleeschouwer, Christophe. |
Permanent URL |
http://hdl.handle.net/2078.1/thesis:25145 |